API

Solvers

class lrl.solvers.PolicyIteration(env, value_function_initial_value=0.0, max_policy_eval_iters_per_improvement=10, policy_evaluation_type='on-policy-iterative', **kwargs)

Bases: lrl.solvers.base_solver.BaseSolver

Solver for policy iteration

Implemented as per Sutton and Barto’s Reinforcement Learning (http://www.incompleteideas.net/book/RLbook2018.pdf, page 80).

Notes

See also BaseSolver for additional attributes, members, and arguments (missing here due to Sphinx bug with inheritance in docs)

Examples

See examples directory

Parameters
  • value_function_initial_value (float) – Value to initialize all elements of the value function to

  • max_policy_eval_iters_per_improvement

  • policy_evaluation_type (str) – Type of solution method for calculating policy (see policy_evaluation() for more details. Typical usage should not need to change this as it will make calculations slower and more memory intensive)

  • BaseSolver class for additional (See) –

Returns

None

value = None

Space-efficient dict-like storage of the current and all former value functions

Type

DictWithHistory

iterate()

Perform a single iteration of policy iteration, updating self.value and storing metadata about the iteration.

Side Effects:

  • self.value: Updated to the newest estimate of the value function

  • self.policy: Updated to the greedy policy according to the value function estimate

  • self.iteration: Increment iteration counter by 1

  • self.iteration_data: Add new record to iteration data store

Returns

None

converged()

Returns True if solver is converged.

Judge convergence by checking whether the most recent policy iteration resulted in any changes in policy

Returns

Convergence status (True=converged)

Return type

bool

_policy_evaluation(max_iters=None)

Compute an estimate of the value function for the current policy to within self.tolerance

Side Effects:

self.value: Updated to the newest estimate of the value function

Returns

None

_policy_improvement(return_differences=True)

Update the policy to be greedy relative to the most recent value function

Side Effects:

self.policy: Updated to be greedy relative to self.value

Parameters

return_differences – If True, return number of differences between old and new policies

Returns

(if return_differences==True) Number of differences between the old and new policies

Return type

int

init_policy(init_type=None)

Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies

Parameters

init_type (None, str) –

Method used for initializing policy. Can be any of:

  • None: Uses value in self.policy_init_type

  • zeros: Initialize policy to all 0’s (first action)

  • random: Initialize policy to a random action (action indices are random integer from

    [0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)

Side Effects:

If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)

Returns

None

iterate_to_convergence(raise_if_not_converged=None, score_while_training=None)

Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically

Side Effects:

Many, but depends on the subclass of the solver’s .iterate()

Parameters
  • raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged

  • score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs

Returns

None

run_policy(max_steps=None, initial_state=None)

Perform a walk (episode) through the environment using the current policy

Side Effects:
  • self.env will be reset and optionally then forced into initial_state

Parameters
  • max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode

  • initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

tuple containing:

  • states (list): boolean indicating if the episode was terminal according to the environment

  • rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)

  • is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally

Return type

(tuple)

score_policy(iters=500, max_steps=None, initial_state=None)

Score the current policy by performing iters greedy episodes in the environment and returning statistics

Side Effects:

self.env will be reset more side effects

more side effects

Parameters
  • iters – Number of episodes in the environment

  • max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode

  • initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

Object containing statistics about the episodes (rewards, number of steps, etc.)

Return type

EpisodeStatistics

class lrl.solvers.ValueIteration(env, value_function_initial_value=0.0, **kwargs)

Bases: lrl.solvers.base_solver.BaseSolver

Solver for value iteration

Implemented as per Sutton and Barto’s Reinforcement Learning (http://www.incompleteideas.net/book/RLbook2018.pdf, page 82).

Notes

See also BaseSolver for additional attributes, members, and arguments (missing here due to Sphinx bug with inheritance in docs)

Examples

See examples directory

Parameters
  • value_function_initial_value (float) – Value to initialize all elements of the value function to

  • BaseSolver class for additional (See) –

Returns

None

value = None

Space-efficient dict-like storage of the current and all former value functions

Type

DictWithHistory

iterate()

Perform a single iteration of value iteration, updating self.value and storing metadata about the iteration.

Side Effects:

  • self.value: Updated to the newest estimate of the value function

  • self.policy: Updated to the greedy policy according to the value function estimate

  • self.iteration: Increment iteration counter by 1

  • self.iteration_data: Add new record to iteration data store

Returns

None

converged()

Returns True if solver is converged.

Test convergence by comparing the latest value function delta_max to the convergence tolerance

Returns

Convergence status (True=converged)

Return type

bool

init_policy(init_type=None)

Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies

Parameters

init_type (None, str) –

Method used for initializing policy. Can be any of:

  • None: Uses value in self.policy_init_type

  • zeros: Initialize policy to all 0’s (first action)

  • random: Initialize policy to a random action (action indices are random integer from

    [0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)

Side Effects:

If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)

Returns

None

iterate_to_convergence(raise_if_not_converged=None, score_while_training=None)

Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically

Side Effects:

Many, but depends on the subclass of the solver’s .iterate()

Parameters
  • raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged

  • score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs

Returns

None

run_policy(max_steps=None, initial_state=None)

Perform a walk (episode) through the environment using the current policy

Side Effects:
  • self.env will be reset and optionally then forced into initial_state

Parameters
  • max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode

  • initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

tuple containing:

  • states (list): boolean indicating if the episode was terminal according to the environment

  • rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)

  • is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally

Return type

(tuple)

score_policy(iters=500, max_steps=None, initial_state=None)

Score the current policy by performing iters greedy episodes in the environment and returning statistics

Side Effects:

self.env will be reset more side effects

more side effects

Parameters
  • iters – Number of episodes in the environment

  • max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode

  • initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

Object containing statistics about the episodes (rewards, number of steps, etc.)

Return type

EpisodeStatistics

class lrl.solvers.QLearning(env, value_function_tolerance=0.1, alpha=None, epsilon=None, max_iters=2000, min_iters=250, num_episodes_for_convergence=20, **kwargs)

Bases: lrl.solvers.base_solver.BaseSolver

Solver class for Q-Learning

Notes

See also BaseSolver for additional attributes, members, and arguments (missing due here to Sphinx bug with inheritance in docs)

Examples

See examples directory

Parameters
  • alpha (float, dict) –

    (OPTIONAL)

    • If None, default linear decay schedule applied, decaying from 0.1 at iter 0 to 0.025 at max iter

    • If float, interpreted as a constant alpha value

    • If dict, interpreted as specifications to a decay function as defined in decay_functions()

  • epsilon (float, dict) –

    (OPTIONAL)

    • If None, default linear decay schedule applied, decaying from 0.25 at iter 0 to 0.05 at max iter

    • If float, interpreted as a constant epsilon value

    • If dict, interpreted as specifications to a decay function as defined in decay_functions()

  • num_episodes_for_convergence (int) – Number of consecutive episodes with delta_Q < tolerance to say a solution is converged

  • **kwargs – Other arguments passed to BaseSolver

Returns

None

transitions = None

Counter for number of transitions experienced during all learning

Type

int

q = None

Space-efficient dict-like storage of the current and all former q functions

Type

DictWithHistory

iteration_data = None

Data store for iteration data

Overloads BaseSolver’s iteration_data attribute with one that includes more fields

Type

GeneralIterationData

episode_statistics = None

Data store for statistics from training episodes

Type

EpisodeStatistics

num_episodes_for_convergence = None

Number of consecutive episodes with delta_Q < tolerance to say a solution is converged

Type

int

_policy_improvement(states=None)

Update the policy to be greedy relative to the most recent q function

Side Effects:

self.policy: Updated to be greedy relative to self.q

Parameters

states – List of states to update. If None, all states will be updated

Returns

None

step(count_transition=True)

Take and learn from a single step in the environment.

Applies the typical Q-Learning approach to learn from the experienced transition

Parameters

count_transition (bool) – If True, increment transitions counter self.transitions. Else, do not.

Returns

tuple containing:

  • transition (tuple): Tuple of (state, reward, next_state, is_terminal)

  • delta_q (float): The (absolute) change in q caused by this step

Return type

(tuple)

iterate()

Perform and learn from a single episode in the environment (one walk from start to finish)

Side Effects:

  • self.value: Updated to the newest estimate of the value function

  • self.policy: Updated to the greedy policy according to the value function estimate

  • self.iteration: Increment iteration counter by 1

  • self.iteration_data: Add new record to iteration data store

  • self.env: Reset and then walked through

Returns

None

choose_epsilon_greedy_action(state, epsilon=None)

Return an action chosen by epsilon-greedy scheme based on the current estimate of Q

Parameters
  • state (int, tuple) – Descriptor of current state in environment

  • epsilon – Optional. If None, self.epsilon is used

Returns

action chosen

Return type

int or tuple

converged()

Returns True if solver is converged.

Returns

Convergence status (True=converged)

Return type

bool

get_q_at_state(state)

Returns a numpy array of q values at the current state in the same order as the standard action indexing :param state: Descriptor of current state in environment :type state: int, tuple

Returns

Numpy array of q for all actions

Return type

np.array

init_policy(init_type=None)

Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies

Parameters

init_type (None, str) –

Method used for initializing policy. Can be any of:

  • None: Uses value in self.policy_init_type

  • zeros: Initialize policy to all 0’s (first action)

  • random: Initialize policy to a random action (action indices are random integer from

    [0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)

Side Effects:

If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)

Returns

None

init_q(init_val=0.0)

Initialize self.q, a dict-like DictWithHistory object for storing the state-action value function q

Parameters

init_val (float) – Value to give all states in the initialized q

Returns

None

iterate_to_convergence(raise_if_not_converged=None, score_while_training=None)

Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically

Side Effects:

Many, but depends on the subclass of the solver’s .iterate()

Parameters
  • raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged

  • score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs

Returns

None

run_policy(max_steps=None, initial_state=None)

Perform a walk (episode) through the environment using the current policy

Side Effects:
  • self.env will be reset and optionally then forced into initial_state

Parameters
  • max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode

  • initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

tuple containing:

  • states (list): boolean indicating if the episode was terminal according to the environment

  • rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)

  • is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally

Return type

(tuple)

score_policy(iters=500, max_steps=None, initial_state=None)

Score the current policy by performing iters greedy episodes in the environment and returning statistics

Side Effects:

self.env will be reset more side effects

more side effects

Parameters
  • iters – Number of episodes in the environment

  • max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode

  • initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

Object containing statistics about the episodes (rewards, number of steps, etc.)

Return type

EpisodeStatistics

property alpha

Returns value of alpha at current iteration

property epsilon

Returns value of epsilon at current iteration

class lrl.solvers.BaseSolver(env, gamma=0.9, value_function_tolerance=0.001, policy_init_mode='zeros', max_iters=500, min_iters=2, max_steps_per_episode=100, score_while_training=False, raise_if_not_converged=False)

Bases: object

Base class for solvers

Examples

See examples directory

Parameters
  • env – Environment instance, such as from RaceTrack() or RewardingFrozenLake()

  • gamma (float) – Discount factor

  • value_function_tolerance (float) – Tolerance for convergence of value function during solving (also used for Q (state-action) value function tolerance

  • policy_init_mode (str) – Initialization mode for policy. See init_policy() for more detail

  • max_iters (int) – Maximum number of iterations to solve environment

  • min_iters (int) – Minimum number of iterations before checking for solver convergence

  • raise_if_not_converged (bool) – If True, will raise exception when environment hits max_iters without convergence. If False, a warning will be logged.

  • max_steps_per_episode (int) – Maximum number of steps allowed per episode (helps when evaluating policies that can lead to infinite walks)

  • score_while_training (dict, bool) –

    Dict specifying whether the policy should be scored during training (eg: test how well a policy is doing every N iterations).

    If dict, must be of format:

    • n_trains_per_eval (int): Number of training iters between evaluations

    • n_evals (int): Number of episodes for a given policy evaluation

    If True, score with default settings of:

    • n_trains_per_eval: 500

    • n_evals: 500

    If False, do not score during training.

Returns

None

env = None

Environment being solved

Type

Racetrack, RewardingFrozenLakeEnv

policy = None

Space-efficient dict-like storage of the current and all former policies.

Type

DictWithHistory

iteration_data = None

Data describing iteration results during solving of the environment.

Fields include:

  • time: time for this iteration

  • delta_max: maximum change in value function for this iteration

  • policy_changes: number of policy changes this iteration

  • converged: boolean denoting if solution is converged after this iteration

Type

GeneralIterationData

scoring_summary = None

Summary data from scoring runs computed during training if score_while_training == True

Fields include:

  • reward_mean: mean reward obtained during a given scoring run

Type

GeneralIterationData

scoring_episode_statistics = None

Detailed scoring data from scoring runs held as a dict of EpisodeStatistics objects.

Data is indexed by iteration number (from scoring_summary)

Type

dict, EpisodeStatistics

init_policy(init_type=None)

Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies

Parameters

init_type (None, str) –

Method used for initializing policy. Can be any of:

  • None: Uses value in self.policy_init_type

  • zeros: Initialize policy to all 0’s (first action)

  • random: Initialize policy to a random action (action indices are random integer from

    [0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)

Side Effects:

If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)

Returns

None

iterate()

Perform the a single iteration of the solver.

This may be an iteration through all states in the environment (like in policy iteration) or obtaining and learning from a single experience (like in Q-Learning)

This method should update self.value and may update self.policy, and also commit iteration statistics to self.iteration_data. Unless the subclass implements a custom self.converged, self.iteration_data should include a boolean entry for “converged”, which is used by the default converged() function.

Returns

None

iterate_to_convergence(raise_if_not_converged=None, score_while_training=None)

Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically

Side Effects:

Many, but depends on the subclass of the solver’s .iterate()

Parameters
  • raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged

  • score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs

Returns

None

converged()

Returns True if solver is converged.

This may be custom for each solver, but as a default it checks whether the most recent iteration_data entry has converged==True

Returns

Convergence status (True=converged)

Return type

bool

run_policy(max_steps=None, initial_state=None)

Perform a walk (episode) through the environment using the current policy

Side Effects:
  • self.env will be reset and optionally then forced into initial_state

Parameters
  • max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode

  • initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

tuple containing:

  • states (list): boolean indicating if the episode was terminal according to the environment

  • rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)

  • is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally

Return type

(tuple)

score_policy(iters=500, max_steps=None, initial_state=None)

Score the current policy by performing iters greedy episodes in the environment and returning statistics

Side Effects:

self.env will be reset more side effects

more side effects

Parameters
  • iters – Number of episodes in the environment

  • max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode

  • initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

Object containing statistics about the episodes (rewards, number of steps, etc.)

Return type

EpisodeStatistics

class lrl.solvers.BaseSolver(env, gamma=0.9, value_function_tolerance=0.001, policy_init_mode='zeros', max_iters=500, min_iters=2, max_steps_per_episode=100, score_while_training=False, raise_if_not_converged=False)

Bases: object

Base class for solvers

Examples

See examples directory

Parameters
  • env – Environment instance, such as from RaceTrack() or RewardingFrozenLake()

  • gamma (float) – Discount factor

  • value_function_tolerance (float) – Tolerance for convergence of value function during solving (also used for Q (state-action) value function tolerance

  • policy_init_mode (str) – Initialization mode for policy. See init_policy() for more detail

  • max_iters (int) – Maximum number of iterations to solve environment

  • min_iters (int) – Minimum number of iterations before checking for solver convergence

  • raise_if_not_converged (bool) – If True, will raise exception when environment hits max_iters without convergence. If False, a warning will be logged.

  • max_steps_per_episode (int) – Maximum number of steps allowed per episode (helps when evaluating policies that can lead to infinite walks)

  • score_while_training (dict, bool) –

    Dict specifying whether the policy should be scored during training (eg: test how well a policy is doing every N iterations).

    If dict, must be of format:

    • n_trains_per_eval (int): Number of training iters between evaluations

    • n_evals (int): Number of episodes for a given policy evaluation

    If True, score with default settings of:

    • n_trains_per_eval: 500

    • n_evals: 500

    If False, do not score during training.

Returns

None

env = None

Environment being solved

Type

Racetrack, RewardingFrozenLakeEnv

policy = None

Space-efficient dict-like storage of the current and all former policies.

Type

DictWithHistory

iteration_data = None

Data describing iteration results during solving of the environment.

Fields include:

  • time: time for this iteration

  • delta_max: maximum change in value function for this iteration

  • policy_changes: number of policy changes this iteration

  • converged: boolean denoting if solution is converged after this iteration

Type

GeneralIterationData

scoring_summary = None

Summary data from scoring runs computed during training if score_while_training == True

Fields include:

  • reward_mean: mean reward obtained during a given scoring run

Type

GeneralIterationData

scoring_episode_statistics = None

Detailed scoring data from scoring runs held as a dict of EpisodeStatistics objects.

Data is indexed by iteration number (from scoring_summary)

Type

dict, EpisodeStatistics

init_policy(init_type=None)

Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies

Parameters

init_type (None, str) –

Method used for initializing policy. Can be any of:

  • None: Uses value in self.policy_init_type

  • zeros: Initialize policy to all 0’s (first action)

  • random: Initialize policy to a random action (action indices are random integer from

    [0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)

Side Effects:

If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)

Returns

None

iterate()

Perform the a single iteration of the solver.

This may be an iteration through all states in the environment (like in policy iteration) or obtaining and learning from a single experience (like in Q-Learning)

This method should update self.value and may update self.policy, and also commit iteration statistics to self.iteration_data. Unless the subclass implements a custom self.converged, self.iteration_data should include a boolean entry for “converged”, which is used by the default converged() function.

Returns

None

iterate_to_convergence(raise_if_not_converged=None, score_while_training=None)

Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically

Side Effects:

Many, but depends on the subclass of the solver’s .iterate()

Parameters
  • raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged

  • score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs

Returns

None

converged()

Returns True if solver is converged.

This may be custom for each solver, but as a default it checks whether the most recent iteration_data entry has converged==True

Returns

Convergence status (True=converged)

Return type

bool

run_policy(max_steps=None, initial_state=None)

Perform a walk (episode) through the environment using the current policy

Side Effects:
  • self.env will be reset and optionally then forced into initial_state

Parameters
  • max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode

  • initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

tuple containing:

  • states (list): boolean indicating if the episode was terminal according to the environment

  • rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)

  • is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally

Return type

(tuple)

score_policy(iters=500, max_steps=None, initial_state=None)

Score the current policy by performing iters greedy episodes in the environment and returning statistics

Side Effects:

self.env will be reset more side effects

more side effects

Parameters
  • iters – Number of episodes in the environment

  • max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode

  • initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

Object containing statistics about the episodes (rewards, number of steps, etc.)

Return type

EpisodeStatistics

Environments

class lrl.environments.Racetrack(track=None, x_vel_limits=None, y_vel_limits=None, x_accel_limits=None, y_accel_limits=None, max_total_accel=2)

Bases: gym.envs.toy_text.discrete.DiscreteEnv

A car-race-like environment that uses location and velocity for state and acceleration for actions, in 2D

Loosely inspired by the Racetrack example of Sutton and Barto’s Reinforcement Learning (Exercise 5.8, http://www.incompleteideas.net/book/the-book.html)

The objective of this environment is to traverse a racetrack from a start location to any goal location. Reaching a goal location returns a large reward and terminates the episode, whereas landing on a grass location returns a large negative reward and terminates the episode. All non-terminal transitions return a small negative reward. Oily road surfaces are non-terminal but also react to an agent’s action stochastically, sometimes causing an Agent to “slip” whereby their requested action is ignored (interpreted as if a=(0,0)).

The tiles in the environment are:

  • (blank): Clean open (deterministic) road

  • O: Oily (stochastic) road

  • G: (terminal) grass

  • S: Starting location (agent starts at a random starting location). After starting, S tiles behave like open road

  • F: Finish location(s) (agent must reach any of these tiles to receive positive reward

The state space of the environment is described by xy location and xy velocity (with maximum velocity being a user-specified parameter). For example, s=(3, 5, 1, -1) means the Agent is currently in the x=3, y=5 location with Vx=1, Vy=-1.

The action space of the environment is xy acceleration (with maximum acceleration being a user-specified parameter). For example, a=(-2, 1) means ax=-2, ay=-1. Transitions are determined by the current velocity as well as the requested acceleration (with a cap set by Vmax of the environment), for example:

  • s=(3, 5, 1, -1), a=(-3, 1) –> s_prime=(1, 5, -2, 0)

But if vx_max == +-1 then:

  • s=(3, 5, 1, -1), a=(-3, 1) –> s_prime=(2, 5, -1, 0)

Note that sign conventions for location are:

  • x: 0 at leftmost column, positive to the right

  • y: 0 at bottommost row, positive up

Parameters
  • track (list) – List of strings describing the track (see racetrack_tracks.py for examples)

  • x_vel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in x. Default is (-2, 2).

  • y_vel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in y. Default is (-2, 2).

  • x_accel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in x. Default is (-2, 2).

  • y_accel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in y. Default is (-2, 2).

  • max_total_accel (int) – (OPTIONAL) Integer maximum total acceleration in one action. Total acceleration is computed by abs(x_a)+abs(y_a), representing the sum of change in acceleration in both directions Default is infinite (eg: any accel described by x and y limits)

Notes

See also discrete.DiscreteEnv for additional attributes, members, and arguments (missing due here to Sphinx bug with inheritance in docs)

DOCTODO: Add examples

track = None

List of strings describing track or the string name of a default track

Type

list, str

desc = None

Numpy character array of the track (better for printing on screen/accessing track at xy locations)

Type

np.array

color_map = None

Map from grid tile type to display color

Type

dict

index_to_state = None

Attribute to map from state index to full tuple describing state

Ex: index_to_state[state_index] -> state_tuple

Type

list

state_to_index = None

Attribute to map from state tuple to state index

Ex: state_to_index[state_tuple] -> state_index

Type

dict

is_location_terminal = None

no rewards/transitions leading out of state).

Keyed by state tuple

Type

dict

Type

Attribute to map whether a state is terminal (eg

s = None

Current state (inherited from parent)

Type

int, tuple

reset()

Reset the environment to a random starting location

Returns

None

render(mode='human', current_location='*')

Render the environment.

Warning

This method does not follow the prototype of it’s parent. It is presently a very simple version for printing the environment’s current state to the screen

Parameters
  • mode – (NOT USED)

  • current_location – Character to denote the current location

Returns

None

step(a)

Take a step in the environment.

This wraps the parent object’s step(), interpreting integer actions as mapped to human-readable actions

Parameters

a (tuple, int) – Action to take, either as an integer (0..nA-1) or true action (tuple of (x_accel,y_accel))

Returns

Next state, either as a tuple or int depending on type of state used

close()

Override _close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns

Returns the list of seeds used in this env’s random

number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.

Return type

list<bigint>

property unwrapped

Completely unwrap this env.

Returns

The base non-wrapped gym.Env instance

Return type

gym.Env

Experiment Runners

lrl.utils.experiment_runners.run_experiment(env, params, output_path)

Run a single experiment (env/solver combination), outputing results to a given location

FUTURE: Improve easy reproducibility by outputting a settings file or similar? Could use gin-config or just output

params. Outputting params doesn’t cover env though…

Parameters
  • env – An instanced environment object (eg: Racetrack( or RewardingFrozenLake())

  • params – A dictionary of solver parameters for this run

  • output_path (str) – Path to output data (plots and csvs)

Output to output_path:

  • iteration_data.csv: Data about each solver iteration (shows how long each iteration took, how quickly the solver converged, etc.)

  • solver_results*.png: Images of policy (and value for planners). If environment state is defined by xy alone, a single image is returned. Else, an image for each additional state is returned (eg: for state = (x, y, vx, vy), plots of solver_results_vx_vy.png are returned for each (vx, vy))

  • scored_episodes.csv and scored_episodes.png: Detailed data for each episode taken during the final scoring, and a composite image of those episodes in the environment

  • intermediate_scoring_results.csv: Summary data from each evaluation during training (shows history of how the solver improved over time)

  • intermediate_scoring_results_*.png: Composite images of the intermediate scoring results taken during training, indexed by the iteration at which they were produced

  • training_episodes.csv and training_episodes.png: Detailed data for each episode taken during training, and an composite image of those episodes exploring the environment (only available for an explorational learner like Q-Learning)

Returns

dict containing:

  • solver (BaseSolver, ValueIteration, PolicyIteration, QLearner): Fully populated solver object (after solving env)

  • scored_results (EpisodeStatistics): EpisodeStatistics object of results from scoring the final policy

  • solve_time (float): Time in seconds used to solve the env (eg: run solver.iterate_to_convergence())

Return type

(dict)

lrl.utils.experiment_runners.run_experiments(environments, solver_param_grid, output_path='./output/')

Runs a set of experiments defined by param_grid, writing results to output_path

Parameters
  • environments (list) – List of instanced environments

  • solver_param_grid (dict) – Solver parameters in suitable form for sklearn.model_selection.ParameterGrid

  • output_path (str) – Relative path to which results will be output

Output to output_path:

  • For each environment:

    • env_name/grid_search_summary.csv: high-level summary of results for this env

    • env_name/case_name: Directory with detailed results for each env/case combination See run_experiment for details on casewise output)

Returns

None

Plotting

lrl.utils.plotting.plot_solver_convergence(solver, **kwargs)

Convenience binding to plot convergence statistics for a solver object.

Also useful as a recipe for custom plotting.

Parameters
  • solver (BaseSolver (or child)) – Solver object to be plotted

  • args (Other) – See plot_solver_convergence_from_df()

Returns

Matplotlib axes object

Return type

Axes

lrl.utils.plotting.plot_solver_convergence_from_df(df, y='delta_max', y_label=None, x='iteration', x_label='Iteration', label=None, ax=None, savefig=None, **kwargs)

Convenience binding to plot convergence statistics for a set of solver objects.

Also useful as a recipe for custom plotting.

Parameters
  • df (pandas.DataFrame) – DataFrame with solver convergence data

  • y (str) – Convergence statistic to be plotted (eg: delta_max, delta_mean, time, or policy_changes)

  • y_label (str) – Optional label for y_axis (if omitted, will use y as default name unless axis is already labeled)

  • x (str) – X axis data (typically ‘iteration’, but could be any convergence data)

  • x_label (str) – Optional label for x_axis (if omitted, will use ‘Iteration’)

  • label (str) – Optional label for the data set (shows up in axes legend)

  • ax (Axes) – Optional Matplotlib Axes object to add this line to

  • savefig (str) – Optional filename to save the figure to

  • kwargs – Additional args passed to matplotlib’s plot

Returns

Matplotlib axes object

Return type

Axes

lrl.utils.plotting.plot_env(env, ax=None, edgecolor='k', resize_figure=True, savefig=None)

Plot the map of an environment

Parameters
  • env – Environment to plot

  • ax (axes) – (Optional) Axes object to plot on

  • edgecolor (str) – Color of the edge of each grid square (matplotlib format)

  • resize_figure (bool) –

    If true, resize the figure to:

    • width = 0.5 * n_cols inches

    • height = 0.5 * n_rows inches

  • savefig (str) – If not None, save the figure to this filename

Returns

Matplotlib axes object

Return type

Axes

lrl.utils.plotting.plot_solver_results(env, solver=None, policy=None, value=None, savefig=None, **kwargs)

Convenience function to plot results from a solver over the environment map

Input can be using a BaseSolver or child object, or by specifying policy and/or value directly via dict or DictWithHistory.

See plot_solver_result() for more info on generation of individual plots and additional arguments for color/precision.

Parameters
  • env – Augmented OpenAI Gym-like environment object

  • solver (BaseSolver) – Solver object used to solve the environment

  • policy (dict, DictWithHistory) – Policy for the environment, keyed by integer state-index or tuples of state

  • value (dict, DictWithHistory) – Value function for the environment, keyed by integer state-index or tuples of state

  • savefig (str) – If not None, save figures to this name. For cases with multiple policies per grid square, this will be the suffix on the name (eg: for policy at Vx=1, Vy=2, we get name of savefig_1_2.png)

  • **kwargs (dict) – Other arguments passed to plot_solver_result

Returns

list of Matplotlib Axes for the plots

Return type

list

lrl.utils.plotting.plot_policy(env, policy, **kwargs)

Convenience binding for plot_policy_or_value(). See plot_policy_or_value for more detail

lrl.utils.plotting.plot_value(env, value, **kwargs)

Convenience binding for plot_policy_or_value(). See plot_policy_or_value for more detail

lrl.utils.plotting.plot_solver_result(env, policy=None, value=None, ax=None, add_env_to_plot=True, hide_terminal_locations=True, color='k', title=None, savefig=None, size_policy='auto', size_value='auto', value_precision=2)

Plot result for a single xy map using a numpy array of shaped policy and/or value

Parameters
  • env (Racetrack, FrozenLake, other environment) – Instantiated environment object

  • policy (np.array) – Policy for each grid square in the environment, in the same shape as env.desc For plotting environments where we have multiple states for a given grid square (eg for Racetrack), will call plotting for each given additional state (eg: for v=(0, 0), v=(1, 0), ..)

  • value – (np.array): Value for each grid square in the environment, in the same shape as env.desc For plotting environments where we have multiple states for a given grid square (eg for Racetrack), will call plotting for each given additional state (eg: for v=(0, 0), v=(1, 0), ..)

  • ax (Axes) – (OPTIONAL) Matplotlib axes object to plot to

  • add_env_to_plot (bool) – If True, add the environment map to the axes before plotting policy using plot_env()

  • hide_terminal_locations (bool) – If True, all known terminal locations will have no text printed (as policy here doesn’t matter)

  • color (str) – Matplotlib color string denoting color of the text for policy/value

  • title (str) – (Optional) title added to the axes object

  • savefig (str) – (Optional) string filename to output the figure to

  • size_policy (str, numeric) –

    (Optional) Specification of text font size for policy printing. One of:

    • ’auto’: Will automatically choose a font size based on the number of characters to be printed

    • str or numeric: Interpreted as a Matplotlib style font size designation

  • size_value (str, numeric) – (Optional) Specification of text font size for value printing. Same interface as size_policy

  • value_precision (int) – Precision of value function to be included on figures

Returns

Matplotlib Axes object

lrl.utils.plotting.plot_episodes(episodes, env=None, add_env_to_plot=True, max_episodes=100, alpha=None, color='k', title=None, ax=None, savefig=None)

Plot a list of episodes through an environment over a drawing of the environment

Parameters
  • episodes (list, EpisodeStatistics) – Series of episodes to be plotted. If EpisodeStatistics instance, .episodes will be extracted

  • env – Environment traversed

  • add_env_to_plot (bool) – If True, use plot_env to plot the environment to the image

  • alpha (float) – (Optional) alpha (transparency) used for plotting the episode. If left as None, a value will be chosen based on the number of episodes to be plotted

  • color (str) – Matplotlib-style color designation

  • title (str) – (Optional) Title to be added to the axes

  • ax (axes) – (Optional) Matplotlib axes object to write the plot to

  • savefig (str) – (Optional) string filename to output the figure to

  • max_episodes (int) – Maximum number of episodes to add to the plot. If len(episodes) exceeds this value, randomly chosen episodes will be used

Returns

Matplotlib Axes object with episodes plotted to it

lrl.utils.plotting.plot_episode(episode, env=None, add_env_to_plot=True, alpha=None, color='k', title=None, ax=None, savefig=None)

Plot a single episode (walk) through the environment

Parameters
  • episode (list) – List of states encountered in the episode

  • env – Environment traversed

  • add_env_to_plot (bool) – If True, use plot_env to plot the environment to the image

  • alpha (float) – (Optional) alpha (transparency) used for plotting the episode.

  • color (str) – Matplotlib-style color designation

  • title (str) – (Optional) Title to be added to the axes

  • ax (axes) – (Optional) Matplotlib axes object to write the plot to

  • savefig (str) – (Optional) string filename to output the figure to

Returns

Matplotlib Axes object with a single episode plotted to it

lrl.utils.plotting.choose_text_size(n_chars, boxsize=1.0)

Helper to choose an appropriate text size when plotting policies. Size is chosen based on length of text

Return is calibrated to something that typically looked nice in testing

Parameters
  • n_chars – Text caption to be added to plot

  • boxsize (float) – Size of box inside which text should print nicely. Used as a scaling factor. Default is 1 inch

Returns

Matplotlib-style text size argument

lrl.utils.plotting.policy_dict_to_array(env, policy_dict)

Convert a policy stored as a dictionary into a dictionary of one or more policy numpy arrays shaped like env.desc

Can also be used for a value_dict.

policy_dict is a dictionary relating state to policy at that state in one of several forms. The dictionary can be keyed by state-index or a tuple of state (eg: (x, y, [other_state]), with x=0 in left column, y=0 in bottom row). If using tuples of state, state may be more than just x,y location as shown above, eg: (x, y, v_x, v_y). If len(state_tuple) > 2, we must plot each additional state separately.

Translate policy_dict into a policy_list_of_tuples of:

[(other_state_0, array_of_policy_at_other_state_0),
 (other_state_1, array_of_policy_at_other_state_1),
  ... ]

where the array_of_policy_at_other_state_* is in the same shape as env.desc (eg: cell [3, 2] of the array is the policy for the env.desc[3, 2] location in the env).

Examples

If state is described by tuples of (x, y) (where there is a single unique state for each grid location), eg:

policy_dict = {
    (0, 0): policy_0_0,
    (0, 1): policy_0_1,
    (0, 2): policy_0_2,
    ...
    (1, 0): policy_2_1,
    (1, 1): policy_2_1,
    ...
    (xmax, ymax): policy_xmax_ymax,
    }

then a single-element list is returned of the form:

returned = [
  (None, np_array_of_policy),
]

where np_array_of_policy is of the same shape as env.desc (eg: the map), with each element corresponding to the policy at that grid location (for example, cell [3, 2] of the array is the policy for the env.desc[3, 2] location in the env).

If state is described by tuples of (x, y, something_else, [more_something_else…]), for example if state = (x, y, Vx, Vy) like below:

policy_dict = {
    (0, 0, 0, 0): policy_0_0_0_0,
    (0, 0, 1, 0): policy_0_0_1_0,
    (0, 0, 0, 1): policy_0_0_0_1,
    ...
    (1, 0, 0, 0): policy_1_0_0_0,
    (1, 0, 0, 1): policy_1_0_0_1,
    ...
    (xmax, ymax, Vxmax, Vymax): policy_xmax_ymax_Vxmax_Vymax,
    }

then a list is returned of the form:

returned = [
#   (other_state, np_array_of_policies_for_this_other_state)
    ((0, 0), np_array_of_policies_with_Vx-0_Vy-0),
    ((1, 0), np_array_of_policies_with_Vx-0_Vy-0),
    ((0, 1), np_array_of_policies_with_Vx-0_Vy-0),
    ...
    ((Vxmax, Vymax), np_array_of_policies_with_Vxmax_Vymax),
]

where each element corresponds to a different combination of all the non-location state. This means that each element of the list is:

(Identification_of_this_case, shaped_xy-grid_of_policies_for_this_case)

and can be easily plotted over the environment’s map.

If policy_dict is keyed by state-index rather than state directly, the same logic as above still applies.

Notes

If using an environment (with policy keyed by either index or state) that has more than one unique state per grid location (eg: state has more than (x, y)), then environment must also have an index_to_state attribute to identify overlapping states. This constraint exists both for policies keyed by index or state, but the code could be refactored to avoid this limitation for state-keyed policies if required.

Parameters
  • env – Augmented OpenAI Gym-like environment object

  • policy_dict (dict) – Dictionary of policy for the environment, keyed by integer state-index or tuples of state

Returns

list of (description, shaped_policy) elements as described above

lrl.utils.plotting.get_ax(ax)

Returns figure and axes objects associated with an axes, instantiating if input is None

Data Stores

class lrl.data_stores.GeneralIterationData(columns=None)

Bases: object

Class to store data about solver iterations

Data is stored as a list of dictionaries. This is a placeholder for more advanced storage. Class gives a minimal set of extra bindings for convenience.

The present object has no checks to ensure consistency between added records (all have same fields, etc.). If any columns are missing from an added record, outputting to a dataframe will result in Pandas treating these as missing columns from a record.

Parameters

columns (list) – An optional list of column names for the data (if specified, this sets the order of the columns in any output Pandas DataFrame or csv)

DOCTODO: Add example of usage

columns = None

Column names used for data output.

If specified, this sets the order of any columns being output to Pandas DataFrame or csv

Type

list

data = None

List of dictionaries representing records.

Intended to be internal in future, but public at present to give easy access to records for slicing

Type

list

add(d)

Add a dictionary record to the data structure.

Parameters

d (dict) – Dictionary of data to be stored

Returns

None

get(i=-1)

Return the ith entry in the data store (index of storage is in order in which data is committed to this object)

Parameters

i (int) – Index of data to return (can be any valid list index, including -1 and slices)

Returns

ith entry in the data store

Return type

dict

to_dataframe()

Returns the data structure as a Pandas DataFrame

Returns

Pandas DataFrame of the data

Return type

dataframe

to_csv(filename, **kwargs)

Write data structure to a csv via the Pandas DataFrame

Parameters
  • filename (str) – Filename or full path to output data to

  • kwargs (dict) – Optional arguments to be passed to DataFrame.to_csv()

Returns

None

class lrl.data_stores.DictWithHistory(timepoint_mode='explicit', tolerance=1e-07)

Bases: collections.abc.MutableMapping

Dictionary-like object that maintains a history of all changes, either incrementally or at set timepoints

This object has access like a dictionary, but stores data internally such that the user can later recreate the state of the data from a past timepoint.

The intended use of this object is to store large objects which are iterated on (such as value or policy functions) in a way that a history of changes can be reproduced without having to store a new copy of the object every time. For example, when doing 10000 episodes of Q-Learning in a grid world with 2500 states, we can retain the full policy history during convergence (eg: answer “what was my policy after episode 527”) without keeping 10000 copies of a nearly-identical 2500 element numpy array or dict. The cost for this is some computation, although this generally has not been seen to be too significant (~10’s of seconds for a large Q-Learning problem in testing)

Parameters
  • timepoint_mode (str) – One of:

  • explicit (*) – Timepoint incrementing is handled explicitly by the user (the timepoint only changes if the user invokes .update_timepoint()

  • implicit (*) – Timepoint incrementing is automatic and occurs on every setting action, including redundant sets (setting a key to a value it already holds). This is useful for a timehistory of all sets to the object

  • tolerance (float) – Absolute tolerance to test for when replacing values. If a value to be set is less than tolerance different from the current value, the current value is not changed.

Warning

  • Deletion of keys is not specifically supported. Deletion likely works for the most recent timepoint, but the history does not handle deleted keys properly

  • Numeric data may work best due to how new values are compared to existing data, although tuples have also been tested. See __setitem__ for more detail

DOCTODO: Add example

timepoint_mode = None

See Parameters for definition

Type

str

current_timepoint = None

Timepoint that will be written to next

Type

int

__getitem__(key)

Return the most recent value for key

Returns

Whatever is contained in ._data[key][-1][-1] (return only the value from the most recent timepoint, not the timepoint associated with it)

__setitem__(key, value)

Set the value at a key if it is different from the current data stored at key

Data stored here is stored under the self.current_timepoint.

Difference between new and current values is assessed by testing:

  • new_value == old_value

  • np.isclose(new_value, old_value)

where if neither returns True, the new value is taken to be different from the current value

Side Effects:

If timepoint_mode == ‘implicit’, self.current_timepoint will be incremented after setting data

Parameters
  • key (immutable) – Key under which data is stored

  • value – Value to store at key

Returns

None

update(d)

Update this instance with a dictionary of data, d (similar to dict.update())

Keys in d that are present in this object overwrite the previous value. Keys in d that are missing in this object are added.

All data written from d is given the same timepoint (even if timepoint_mode=implicit) - the addition is treated as a single update to the object rather than a series of updates.

Parameters

d (dict) – Dictionary of data to be added here

Returns

None

get_value_history(key)

Returns a list of tuples of the value at a given key over the entire history of that key

Parameters

key (immutable) – Any valid dictionary key

Returns

list containing tuples of:

  • timepoint (int): Integer timepoint for this value

  • value (float): The value of key at the corresponding timepoint

Return type

(list)

get_value_at_timepoint(key, timepoint=-1)

Returns the value corresponding to a key at the timepoint that is closest to but not greater than timepoint

Raises a KeyError if key did not exist at timepoint. Raises an IndexError if no timepoint exists that applies

Parameters
  • key (immutable) – Any valid dictionary key

  • timepoint (int) – Integer timepoint to return value for. If negative, it is interpreted like typical python indexing (-1 means most recent, -2 means second most recent, …)

Returns

Value corresponding to key at the timepoint closest to but not over timepoint

Return type

numeric

to_dict(timepoint=-1)

Return the state of the data at a given timepoint as a dict

Parameters

timepoint (int) – Integer timepoint to return data as of. If negative, it is interpreted like typical python indexing (-1 means most recent, -2 means second most recent, …)

Returns

Data at timepoint

Return type

dict

clear() → None. Remove all items from D.
get(k[, d]) → D[k] if k in D, else d. d defaults to None.
increment_timepoint()

Increments the timepoint at which the object is currently writing

Returns

None

items() → a set-like object providing a view on D's items
keys() → a set-like object providing a view on D's keys
pop(k[, d]) → v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised.

popitem() → (k, v), remove and return some (key, value) pair

as a 2-tuple; but raise KeyError if D is empty.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D
values() → an object providing a view on D's values
class lrl.data_stores.EpisodeStatistics

Bases: object

Container for statistics about a set of independent episodes through an environment, typically following one policy

Statistics are lazily computed and memorized

DOCTODO: Add example usage. show plot_episodes

rewards = None

List of the total reward for each episode (raw data)

Type

list

episodes = None

List of all episodes passed to the data object (raw data)

Type

list

steps = None

List of the total steps taken for each episode (raw data)

Type

list

terminals = None

List of whether each input episode was terminal (raw data)

Type

list

add(reward, episode, terminal)

Add an episode to the data store

Parameters
  • reward (float) – Total reward from the episode

  • episode (list) – List of states encoutered in the episode, including the starting and final state

  • terminal (bool) – Boolean indicating if episode was terminal (did environment say episode has ended)

Returns

None

get_statistic(statistic='reward_mean', index=-1)

Return a lazily computed and memorized statistic about the rewards from episodes 0 to index

If the statistic has not been previous computed, it will be computed and returned. See .get_statistics() for list of statistics available

Side Effects:

self.statistics[index] will be computed using self.compute() if it has not been already

Parameters
  • statistic (str) – See .compute() for available statistics

  • index (int) – Episode index for requested statistic

Notes

Statistics are computed for all episodes up to and including the requested statistic. For example if episodes have rewards of [1, 3, 5, 10], get_statistic(‘reward_mean’, index=2) returns 3 (mean of [1, 3, 5]).

DOCTODO: Example usage (show getting some statistics)

Returns

Value of the statistic requested

Return type

int, float

get_statistics(index=-1)

Return a lazily computed and memorized dictionary of all statistics about episodes 0 to index

If the statistic has not been previous computed, it will be computed here.

Side Effects:

self.statistics[index] will be computed using self.compute() if it has not been already

Parameters

index (int) – Episode index for requested statistic

Returns

Details and statistics about this iteration, with keys:

Details about this iteration:

  • episode_index (int): Index of episode

  • terminal (bool): Boolean of whether this episode was terminal

  • reward (float): This episode’s reward (included to give easy access to per-iteration data)

  • steps (int): This episode’s steps (included to give easy access to per-iteration data)

Statistics computed for all episodes up to and including this episode:

  • reward_mean (float):

  • reward_median (float):

  • reward_std (float):

  • reward_max (float):

  • reward_min (float):

  • steps_mean (float):

  • steps_median (float):

  • steps_std (float):

  • steps_max (float):

  • steps_min (float):

  • terminal_fraction (float):

Return type

dict

compute(index=-1, force=False)

Compute and store statistics about rewards and steps for episodes up to and including the indexth episode

Side Effects:

self.statistics[index] will be updated

Parameters
  • index (int or 'all') – If integer, the index of the episode for which statistics are computed. Eg: If index==3, compute the statistics (see get_statistics() for list) for the series of episodes from 0 up to and not including 3 (typical python indexing rules) If ‘all’, compute statistics for all indices, skipping any that have been previously computed unless force == True

  • force (bool) –

    If True, always recompute statistics even if they already exist.

    If False, only compute if no previous statistics exist.

Returns

None

to_dataframe(include_episodes=False)

Return a Pandas DataFrame of the episode statistics

See .get_statistics() for a definition of each column. Order of columns is set through self.statistics_columns

Parameters

include_episodes (bool) – If True, add column including the entire episode for each iteration

Returns

Pandas DataFrame

to_csv(filename, **kwargs)

Write statistics to csv via the Pandas DataFrame

See .get_statistics() for a definition of each column. Order of columns is set through self.statistics_columns

Parameters
  • filename (str) – Filename or full path to output data to

  • kwargs (dict) – Optional arguments to be passed to DataFrame.to_csv()

Returns

None

Miscellaneous Utilities

class lrl.utils.misc.Timer

Bases: object

A Simple Timer class for timing code

start = None

timeit.default_timer object initialized at instantiation

elapsed()

Return the time elapsed since this object was instantiated, in seconds

Returns

Time elapsed in seconds

Return type

float

lrl.utils.misc.print_dict_by_row(d, fmt='{key:20s}: {val:d}')

Print a dictionary with a little extra structure, printing a different key/value to each line.

Parameters
  • d (dict) – Dictionary to be printed

  • fmt (str) – Format string to be used for printing. Must contain key and val formatting references

Returns

None

lrl.utils.misc.count_dict_differences(d1, d2, keys=None, raise_on_missing_key=True, print_differences=False)

Return the number of differences between two dictionaries. Useful to compare two policies stored as dictionaries.

Does not properly handle floats that are approximately equal. Mainly use for int and objects with __eq__

Optionally raise an error on missing keys (otherwise missing keys are counted as differences)

Parameters
  • d1 (dict) – Dictionary to compare

  • d2 (dict) – Dictionary to compare

  • keys (list) – Optional list of keys to consider for differences. If None, all keys will be considered

  • raise_on_missing_key (bool) – If true, raise KeyError on any keys not shared by both dictionaries

  • print_differences (bool) – If true, print all differences to screen

Returns

Number of differences between the two dictionaries

Return type

int

lrl.utils.misc.dict_differences(d1, d2)

Return the maximum and mean of the absolute difference between all elements of two dictionaries of numbers

Parameters
  • d1 (dict) – Dictionary to compare

  • d2 (dict) – Dictionary to compare

Returns

tuple containing:

  • float: Maximum elementwise difference

  • float: Sum of elementwise differences

Return type

tuple

lrl.utils.misc.rc_to_xy(row, col, rows)

Convert from (row, col) coordinates (eg: numpy array) to (x, y) coordinates (bottom left = 0,0)

(x, y) convention

  • (0,0) in bottom left

  • x +ve to the right

  • y +ve up

(row,col) convention:

  • (0,0) in top left

  • row +ve down

  • col +ve to the right

Parameters
  • row (int) – row coordinate to be converted

  • col (int) – col coordinate to be converted

  • rows (int) – Total number of rows

Returns

(x, y)

Return type

tuple

lrl.utils.misc.params_to_name(params, n_chars=4, sep='_', first_fields=None, key_remap=None)

Convert a mappable of parameters into a string for easy test naming

Warning

Currently includes hard-coded formatting that interprets keys named ‘alpha’ or ‘epsilon’

Parameters
  • params (dict) – Dictionary to convert to a string

  • n_chars (int) – Number of characters per key to add to string. Eg: if key=’abcdefg’ and n_chars=4, output will be ‘abcd’

  • sep (str) – Separator character between fields (uses one of these between key and value, and two between different key-value pairs

  • first_fields (list) – Optional list of keys to write ahead of other keys (otherwise, output order it sorted)

  • key_remap (list) – List of dictionaries of {key_name: new_key_name} for rewriting keys into more readable strings

Returns

Return type

str