API¶
Solvers¶
-
class
lrl.solvers.
PolicyIteration
(env, value_function_initial_value=0.0, max_policy_eval_iters_per_improvement=10, policy_evaluation_type='on-policy-iterative', **kwargs)¶ Bases:
lrl.solvers.base_solver.BaseSolver
Solver for policy iteration
Implemented as per Sutton and Barto’s Reinforcement Learning (http://www.incompleteideas.net/book/RLbook2018.pdf, page 80).
Notes
See also BaseSolver for additional attributes, members, and arguments (missing here due to Sphinx bug with inheritance in docs)
Examples
See examples directory
- Parameters
value_function_initial_value (float) – Value to initialize all elements of the value function to
max_policy_eval_iters_per_improvement –
policy_evaluation_type (str) – Type of solution method for calculating policy (see policy_evaluation() for more details. Typical usage should not need to change this as it will make calculations slower and more memory intensive)
BaseSolver class for additional (See) –
- Returns
None
-
value
= None¶ Space-efficient dict-like storage of the current and all former value functions
- Type
-
iterate
()¶ Perform a single iteration of policy iteration, updating self.value and storing metadata about the iteration.
Side Effects:
self.value: Updated to the newest estimate of the value function
self.policy: Updated to the greedy policy according to the value function estimate
self.iteration: Increment iteration counter by 1
self.iteration_data: Add new record to iteration data store
- Returns
None
-
converged
()¶ Returns True if solver is converged.
Judge convergence by checking whether the most recent policy iteration resulted in any changes in policy
- Returns
Convergence status (True=converged)
- Return type
bool
-
_policy_evaluation
(max_iters=None)¶ Compute an estimate of the value function for the current policy to within self.tolerance
- Side Effects:
self.value: Updated to the newest estimate of the value function
- Returns
None
-
_policy_improvement
(return_differences=True)¶ Update the policy to be greedy relative to the most recent value function
- Side Effects:
self.policy: Updated to be greedy relative to self.value
- Parameters
return_differences – If True, return number of differences between old and new policies
- Returns
(if return_differences==True) Number of differences between the old and new policies
- Return type
int
-
init_policy
(init_type=None)¶ Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies
- Parameters
init_type (None, str) –
Method used for initializing policy. Can be any of:
None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
- random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)
- Side Effects:
If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)
- Returns
None
-
iterate_to_convergence
(raise_if_not_converged=None, score_while_training=None)¶ Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically
- Side Effects:
Many, but depends on the subclass of the solver’s .iterate()
- Parameters
raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs
- Returns
None
-
run_policy
(max_steps=None, initial_state=None)¶ Perform a walk (episode) through the environment using the current policy
- Side Effects:
self.env will be reset and optionally then forced into initial_state
- Parameters
max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)
- Returns
tuple containing:
states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally
- Return type
(tuple)
-
score_policy
(iters=500, max_steps=None, initial_state=None)¶ Score the current policy by performing iters greedy episodes in the environment and returning statistics
- Side Effects:
self.env will be reset more side effects
more side effects
- Parameters
iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)
- Returns
Object containing statistics about the episodes (rewards, number of steps, etc.)
- Return type
-
class
lrl.solvers.
ValueIteration
(env, value_function_initial_value=0.0, **kwargs)¶ Bases:
lrl.solvers.base_solver.BaseSolver
Solver for value iteration
Implemented as per Sutton and Barto’s Reinforcement Learning (http://www.incompleteideas.net/book/RLbook2018.pdf, page 82).
Notes
See also BaseSolver for additional attributes, members, and arguments (missing here due to Sphinx bug with inheritance in docs)
Examples
See examples directory
- Parameters
value_function_initial_value (float) – Value to initialize all elements of the value function to
BaseSolver class for additional (See) –
- Returns
None
-
value
= None¶ Space-efficient dict-like storage of the current and all former value functions
- Type
-
iterate
()¶ Perform a single iteration of value iteration, updating self.value and storing metadata about the iteration.
Side Effects:
self.value: Updated to the newest estimate of the value function
self.policy: Updated to the greedy policy according to the value function estimate
self.iteration: Increment iteration counter by 1
self.iteration_data: Add new record to iteration data store
- Returns
None
-
converged
()¶ Returns True if solver is converged.
Test convergence by comparing the latest value function delta_max to the convergence tolerance
- Returns
Convergence status (True=converged)
- Return type
bool
-
init_policy
(init_type=None)¶ Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies
- Parameters
init_type (None, str) –
Method used for initializing policy. Can be any of:
None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
- random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)
- Side Effects:
If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)
- Returns
None
-
iterate_to_convergence
(raise_if_not_converged=None, score_while_training=None)¶ Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically
- Side Effects:
Many, but depends on the subclass of the solver’s .iterate()
- Parameters
raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs
- Returns
None
-
run_policy
(max_steps=None, initial_state=None)¶ Perform a walk (episode) through the environment using the current policy
- Side Effects:
self.env will be reset and optionally then forced into initial_state
- Parameters
max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)
- Returns
tuple containing:
states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally
- Return type
(tuple)
-
score_policy
(iters=500, max_steps=None, initial_state=None)¶ Score the current policy by performing iters greedy episodes in the environment and returning statistics
- Side Effects:
self.env will be reset more side effects
more side effects
- Parameters
iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)
- Returns
Object containing statistics about the episodes (rewards, number of steps, etc.)
- Return type
-
class
lrl.solvers.
QLearning
(env, value_function_tolerance=0.1, alpha=None, epsilon=None, max_iters=2000, min_iters=250, num_episodes_for_convergence=20, **kwargs)¶ Bases:
lrl.solvers.base_solver.BaseSolver
Solver class for Q-Learning
Notes
See also BaseSolver for additional attributes, members, and arguments (missing due here to Sphinx bug with inheritance in docs)
Examples
See examples directory
- Parameters
alpha (float, dict) –
(OPTIONAL)
If None, default linear decay schedule applied, decaying from 0.1 at iter 0 to 0.025 at max iter
If float, interpreted as a constant alpha value
If dict, interpreted as specifications to a decay function as defined in decay_functions()
epsilon (float, dict) –
(OPTIONAL)
If None, default linear decay schedule applied, decaying from 0.25 at iter 0 to 0.05 at max iter
If float, interpreted as a constant epsilon value
If dict, interpreted as specifications to a decay function as defined in decay_functions()
num_episodes_for_convergence (int) – Number of consecutive episodes with delta_Q < tolerance to say a solution is converged
**kwargs – Other arguments passed to BaseSolver
- Returns
None
-
transitions
= None¶ Counter for number of transitions experienced during all learning
- Type
int
-
q
= None¶ Space-efficient dict-like storage of the current and all former q functions
- Type
-
iteration_data
= None¶ Data store for iteration data
Overloads BaseSolver’s iteration_data attribute with one that includes more fields
- Type
-
episode_statistics
= None¶ Data store for statistics from training episodes
- Type
-
num_episodes_for_convergence
= None¶ Number of consecutive episodes with delta_Q < tolerance to say a solution is converged
- Type
int
-
_policy_improvement
(states=None)¶ Update the policy to be greedy relative to the most recent q function
- Side Effects:
self.policy: Updated to be greedy relative to self.q
- Parameters
states – List of states to update. If None, all states will be updated
- Returns
None
-
step
(count_transition=True)¶ Take and learn from a single step in the environment.
Applies the typical Q-Learning approach to learn from the experienced transition
- Parameters
count_transition (bool) – If True, increment transitions counter self.transitions. Else, do not.
- Returns
tuple containing:
transition (tuple): Tuple of (state, reward, next_state, is_terminal)
delta_q (float): The (absolute) change in q caused by this step
- Return type
(tuple)
-
iterate
()¶ Perform and learn from a single episode in the environment (one walk from start to finish)
Side Effects:
self.value: Updated to the newest estimate of the value function
self.policy: Updated to the greedy policy according to the value function estimate
self.iteration: Increment iteration counter by 1
self.iteration_data: Add new record to iteration data store
self.env: Reset and then walked through
- Returns
None
-
choose_epsilon_greedy_action
(state, epsilon=None)¶ Return an action chosen by epsilon-greedy scheme based on the current estimate of Q
- Parameters
state (int, tuple) – Descriptor of current state in environment
epsilon – Optional. If None, self.epsilon is used
- Returns
action chosen
- Return type
int or tuple
-
converged
()¶ Returns True if solver is converged.
- Returns
Convergence status (True=converged)
- Return type
bool
-
get_q_at_state
(state)¶ Returns a numpy array of q values at the current state in the same order as the standard action indexing :param state: Descriptor of current state in environment :type state: int, tuple
- Returns
Numpy array of q for all actions
- Return type
np.array
-
init_policy
(init_type=None)¶ Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies
- Parameters
init_type (None, str) –
Method used for initializing policy. Can be any of:
None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
- random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)
- Side Effects:
If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)
- Returns
None
-
init_q
(init_val=0.0)¶ Initialize self.q, a dict-like DictWithHistory object for storing the state-action value function q
- Parameters
init_val (float) – Value to give all states in the initialized q
- Returns
None
-
iterate_to_convergence
(raise_if_not_converged=None, score_while_training=None)¶ Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically
- Side Effects:
Many, but depends on the subclass of the solver’s .iterate()
- Parameters
raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs
- Returns
None
-
run_policy
(max_steps=None, initial_state=None)¶ Perform a walk (episode) through the environment using the current policy
- Side Effects:
self.env will be reset and optionally then forced into initial_state
- Parameters
max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)
- Returns
tuple containing:
states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally
- Return type
(tuple)
-
score_policy
(iters=500, max_steps=None, initial_state=None)¶ Score the current policy by performing iters greedy episodes in the environment and returning statistics
- Side Effects:
self.env will be reset more side effects
more side effects
- Parameters
iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)
- Returns
Object containing statistics about the episodes (rewards, number of steps, etc.)
- Return type
-
property
alpha
¶ Returns value of alpha at current iteration
-
property
epsilon
¶ Returns value of epsilon at current iteration
-
class
lrl.solvers.
BaseSolver
(env, gamma=0.9, value_function_tolerance=0.001, policy_init_mode='zeros', max_iters=500, min_iters=2, max_steps_per_episode=100, score_while_training=False, raise_if_not_converged=False)¶ Bases:
object
Base class for solvers
Examples
See examples directory
- Parameters
env – Environment instance, such as from RaceTrack() or RewardingFrozenLake()
gamma (float) – Discount factor
value_function_tolerance (float) – Tolerance for convergence of value function during solving (also used for Q (state-action) value function tolerance
policy_init_mode (str) – Initialization mode for policy. See init_policy() for more detail
max_iters (int) – Maximum number of iterations to solve environment
min_iters (int) – Minimum number of iterations before checking for solver convergence
raise_if_not_converged (bool) – If True, will raise exception when environment hits max_iters without convergence. If False, a warning will be logged.
max_steps_per_episode (int) – Maximum number of steps allowed per episode (helps when evaluating policies that can lead to infinite walks)
score_while_training (dict, bool) –
Dict specifying whether the policy should be scored during training (eg: test how well a policy is doing every N iterations).
If dict, must be of format:
n_trains_per_eval (int): Number of training iters between evaluations
n_evals (int): Number of episodes for a given policy evaluation
If True, score with default settings of:
n_trains_per_eval: 500
n_evals: 500
If False, do not score during training.
- Returns
None
-
policy
= None¶ Space-efficient dict-like storage of the current and all former policies.
- Type
-
iteration_data
= None¶ Data describing iteration results during solving of the environment.
Fields include:
time: time for this iteration
delta_max: maximum change in value function for this iteration
policy_changes: number of policy changes this iteration
converged: boolean denoting if solution is converged after this iteration
- Type
-
scoring_summary
= None¶ Summary data from scoring runs computed during training if score_while_training == True
Fields include:
reward_mean: mean reward obtained during a given scoring run
- Type
-
scoring_episode_statistics
= None¶ Detailed scoring data from scoring runs held as a dict of EpisodeStatistics objects.
Data is indexed by iteration number (from scoring_summary)
- Type
dict, EpisodeStatistics
-
init_policy
(init_type=None)¶ Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies
- Parameters
init_type (None, str) –
Method used for initializing policy. Can be any of:
None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
- random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)
- Side Effects:
If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)
- Returns
None
-
iterate
()¶ Perform the a single iteration of the solver.
This may be an iteration through all states in the environment (like in policy iteration) or obtaining and learning from a single experience (like in Q-Learning)
This method should update self.value and may update self.policy, and also commit iteration statistics to self.iteration_data. Unless the subclass implements a custom self.converged, self.iteration_data should include a boolean entry for “converged”, which is used by the default converged() function.
- Returns
None
-
iterate_to_convergence
(raise_if_not_converged=None, score_while_training=None)¶ Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically
- Side Effects:
Many, but depends on the subclass of the solver’s .iterate()
- Parameters
raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs
- Returns
None
-
converged
()¶ Returns True if solver is converged.
This may be custom for each solver, but as a default it checks whether the most recent iteration_data entry has converged==True
- Returns
Convergence status (True=converged)
- Return type
bool
-
run_policy
(max_steps=None, initial_state=None)¶ Perform a walk (episode) through the environment using the current policy
- Side Effects:
self.env will be reset and optionally then forced into initial_state
- Parameters
max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)
- Returns
tuple containing:
states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally
- Return type
(tuple)
-
score_policy
(iters=500, max_steps=None, initial_state=None)¶ Score the current policy by performing iters greedy episodes in the environment and returning statistics
- Side Effects:
self.env will be reset more side effects
more side effects
- Parameters
iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)
- Returns
Object containing statistics about the episodes (rewards, number of steps, etc.)
- Return type
-
class
lrl.solvers.
BaseSolver
(env, gamma=0.9, value_function_tolerance=0.001, policy_init_mode='zeros', max_iters=500, min_iters=2, max_steps_per_episode=100, score_while_training=False, raise_if_not_converged=False) Bases:
object
Base class for solvers
Examples
See examples directory
- Parameters
env – Environment instance, such as from RaceTrack() or RewardingFrozenLake()
gamma (float) – Discount factor
value_function_tolerance (float) – Tolerance for convergence of value function during solving (also used for Q (state-action) value function tolerance
policy_init_mode (str) – Initialization mode for policy. See init_policy() for more detail
max_iters (int) – Maximum number of iterations to solve environment
min_iters (int) – Minimum number of iterations before checking for solver convergence
raise_if_not_converged (bool) – If True, will raise exception when environment hits max_iters without convergence. If False, a warning will be logged.
max_steps_per_episode (int) – Maximum number of steps allowed per episode (helps when evaluating policies that can lead to infinite walks)
score_while_training (dict, bool) –
Dict specifying whether the policy should be scored during training (eg: test how well a policy is doing every N iterations).
If dict, must be of format:
n_trains_per_eval (int): Number of training iters between evaluations
n_evals (int): Number of episodes for a given policy evaluation
If True, score with default settings of:
n_trains_per_eval: 500
n_evals: 500
If False, do not score during training.
- Returns
None
-
env
= None Environment being solved
- Type
Racetrack, RewardingFrozenLakeEnv
-
policy
= None Space-efficient dict-like storage of the current and all former policies.
- Type
-
iteration_data
= None Data describing iteration results during solving of the environment.
Fields include:
time: time for this iteration
delta_max: maximum change in value function for this iteration
policy_changes: number of policy changes this iteration
converged: boolean denoting if solution is converged after this iteration
- Type
-
scoring_summary
= None Summary data from scoring runs computed during training if score_while_training == True
Fields include:
reward_mean: mean reward obtained during a given scoring run
- Type
-
scoring_episode_statistics
= None Detailed scoring data from scoring runs held as a dict of EpisodeStatistics objects.
Data is indexed by iteration number (from scoring_summary)
- Type
dict, EpisodeStatistics
-
init_policy
(init_type=None) Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies
- Parameters
init_type (None, str) –
Method used for initializing policy. Can be any of:
None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
- random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)
- Side Effects:
If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)
- Returns
None
-
iterate
() Perform the a single iteration of the solver.
This may be an iteration through all states in the environment (like in policy iteration) or obtaining and learning from a single experience (like in Q-Learning)
This method should update self.value and may update self.policy, and also commit iteration statistics to self.iteration_data. Unless the subclass implements a custom self.converged, self.iteration_data should include a boolean entry for “converged”, which is used by the default converged() function.
- Returns
None
-
iterate_to_convergence
(raise_if_not_converged=None, score_while_training=None) Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically
- Side Effects:
Many, but depends on the subclass of the solver’s .iterate()
- Parameters
raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs
- Returns
None
-
converged
() Returns True if solver is converged.
This may be custom for each solver, but as a default it checks whether the most recent iteration_data entry has converged==True
- Returns
Convergence status (True=converged)
- Return type
bool
-
run_policy
(max_steps=None, initial_state=None) Perform a walk (episode) through the environment using the current policy
- Side Effects:
self.env will be reset and optionally then forced into initial_state
- Parameters
max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)
- Returns
tuple containing:
states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally
- Return type
(tuple)
-
score_policy
(iters=500, max_steps=None, initial_state=None) Score the current policy by performing iters greedy episodes in the environment and returning statistics
- Side Effects:
self.env will be reset more side effects
more side effects
- Parameters
iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)
- Returns
Object containing statistics about the episodes (rewards, number of steps, etc.)
- Return type
Environments¶
-
class
lrl.environments.
Racetrack
(track=None, x_vel_limits=None, y_vel_limits=None, x_accel_limits=None, y_accel_limits=None, max_total_accel=2)¶ Bases:
gym.envs.toy_text.discrete.DiscreteEnv
A car-race-like environment that uses location and velocity for state and acceleration for actions, in 2D
Loosely inspired by the Racetrack example of Sutton and Barto’s Reinforcement Learning (Exercise 5.8, http://www.incompleteideas.net/book/the-book.html)
The objective of this environment is to traverse a racetrack from a start location to any goal location. Reaching a goal location returns a large reward and terminates the episode, whereas landing on a grass location returns a large negative reward and terminates the episode. All non-terminal transitions return a small negative reward. Oily road surfaces are non-terminal but also react to an agent’s action stochastically, sometimes causing an Agent to “slip” whereby their requested action is ignored (interpreted as if a=(0,0)).
The tiles in the environment are:
(blank): Clean open (deterministic) road
O: Oily (stochastic) road
G: (terminal) grass
S: Starting location (agent starts at a random starting location). After starting, S tiles behave like open road
F: Finish location(s) (agent must reach any of these tiles to receive positive reward
The state space of the environment is described by xy location and xy velocity (with maximum velocity being a user-specified parameter). For example, s=(3, 5, 1, -1) means the Agent is currently in the x=3, y=5 location with Vx=1, Vy=-1.
The action space of the environment is xy acceleration (with maximum acceleration being a user-specified parameter). For example, a=(-2, 1) means ax=-2, ay=-1. Transitions are determined by the current velocity as well as the requested acceleration (with a cap set by Vmax of the environment), for example:
s=(3, 5, 1, -1), a=(-3, 1) –> s_prime=(1, 5, -2, 0)
But if vx_max == +-1 then:
s=(3, 5, 1, -1), a=(-3, 1) –> s_prime=(2, 5, -1, 0)
Note that sign conventions for location are:
x: 0 at leftmost column, positive to the right
y: 0 at bottommost row, positive up
- Parameters
track (list) – List of strings describing the track (see racetrack_tracks.py for examples)
x_vel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in x. Default is (-2, 2).
y_vel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in y. Default is (-2, 2).
x_accel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in x. Default is (-2, 2).
y_accel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in y. Default is (-2, 2).
max_total_accel (int) – (OPTIONAL) Integer maximum total acceleration in one action. Total acceleration is computed by abs(x_a)+abs(y_a), representing the sum of change in acceleration in both directions Default is infinite (eg: any accel described by x and y limits)
Notes
See also discrete.DiscreteEnv for additional attributes, members, and arguments (missing due here to Sphinx bug with inheritance in docs)
DOCTODO: Add examples
-
track
= None¶ List of strings describing track or the string name of a default track
- Type
list, str
-
desc
= None¶ Numpy character array of the track (better for printing on screen/accessing track at xy locations)
- Type
np.array
-
color_map
= None¶ Map from grid tile type to display color
- Type
dict
-
index_to_state
= None¶ Attribute to map from state index to full tuple describing state
Ex: index_to_state[state_index] -> state_tuple
- Type
list
-
state_to_index
= None¶ Attribute to map from state tuple to state index
Ex: state_to_index[state_tuple] -> state_index
- Type
dict
-
is_location_terminal
= None¶ no rewards/transitions leading out of state).
Keyed by state tuple
- Type
dict
- Type
Attribute to map whether a state is terminal (eg
-
s
= None¶ Current state (inherited from parent)
- Type
int, tuple
-
reset
()¶ Reset the environment to a random starting location
- Returns
None
-
render
(mode='human', current_location='*')¶ Render the environment.
Warning
This method does not follow the prototype of it’s parent. It is presently a very simple version for printing the environment’s current state to the screen
- Parameters
mode – (NOT USED)
current_location – Character to denote the current location
- Returns
None
-
step
(a)¶ Take a step in the environment.
This wraps the parent object’s step(), interpreting integer actions as mapped to human-readable actions
- Parameters
a (tuple, int) – Action to take, either as an integer (0..nA-1) or true action (tuple of (x_accel,y_accel))
- Returns
Next state, either as a tuple or int depending on type of state used
-
close
()¶ Override _close in your subclass to perform any necessary cleanup.
Environments will automatically close() themselves when garbage collected or when the program exits.
-
seed
(seed=None)¶ Sets the seed for this env’s random number generator(s).
Note
Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
- Returns
- Returns the list of seeds used in this env’s random
number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
- Return type
list<bigint>
-
property
unwrapped
¶ Completely unwrap this env.
- Returns
The base non-wrapped gym.Env instance
- Return type
gym.Env
Experiment Runners¶
-
lrl.utils.experiment_runners.
run_experiment
(env, params, output_path)¶ Run a single experiment (env/solver combination), outputing results to a given location
- FUTURE: Improve easy reproducibility by outputting a settings file or similar? Could use gin-config or just output
params. Outputting params doesn’t cover env though…
- Parameters
env – An instanced environment object (eg: Racetrack( or RewardingFrozenLake())
params – A dictionary of solver parameters for this run
output_path (str) – Path to output data (plots and csvs)
Output to output_path:
iteration_data.csv: Data about each solver iteration (shows how long each iteration took, how quickly the solver converged, etc.)
solver_results*.png: Images of policy (and value for planners). If environment state is defined by xy alone, a single image is returned. Else, an image for each additional state is returned (eg: for state = (x, y, vx, vy), plots of solver_results_vx_vy.png are returned for each (vx, vy))
scored_episodes.csv and scored_episodes.png: Detailed data for each episode taken during the final scoring, and a composite image of those episodes in the environment
intermediate_scoring_results.csv: Summary data from each evaluation during training (shows history of how the solver improved over time)
intermediate_scoring_results_*.png: Composite images of the intermediate scoring results taken during training, indexed by the iteration at which they were produced
training_episodes.csv and training_episodes.png: Detailed data for each episode taken during training, and an composite image of those episodes exploring the environment (only available for an explorational learner like Q-Learning)
- Returns
dict containing:
solver (BaseSolver, ValueIteration, PolicyIteration, QLearner): Fully populated solver object (after solving env)
scored_results (EpisodeStatistics): EpisodeStatistics object of results from scoring the final policy
solve_time (float): Time in seconds used to solve the env (eg: run solver.iterate_to_convergence())
- Return type
(dict)
-
lrl.utils.experiment_runners.
run_experiments
(environments, solver_param_grid, output_path='./output/')¶ Runs a set of experiments defined by param_grid, writing results to output_path
- Parameters
environments (list) – List of instanced environments
solver_param_grid (dict) – Solver parameters in suitable form for sklearn.model_selection.ParameterGrid
output_path (str) – Relative path to which results will be output
Output to output_path:
For each environment:
env_name/grid_search_summary.csv: high-level summary of results for this env
env_name/case_name: Directory with detailed results for each env/case combination See run_experiment for details on casewise output)
- Returns
None
Plotting¶
-
lrl.utils.plotting.
plot_solver_convergence
(solver, **kwargs)¶ Convenience binding to plot convergence statistics for a solver object.
Also useful as a recipe for custom plotting.
- Parameters
solver (BaseSolver (or child)) – Solver object to be plotted
args (Other) – See plot_solver_convergence_from_df()
- Returns
Matplotlib axes object
- Return type
Axes
-
lrl.utils.plotting.
plot_solver_convergence_from_df
(df, y='delta_max', y_label=None, x='iteration', x_label='Iteration', label=None, ax=None, savefig=None, **kwargs)¶ Convenience binding to plot convergence statistics for a set of solver objects.
Also useful as a recipe for custom plotting.
- Parameters
df (pandas.DataFrame) – DataFrame with solver convergence data
y (str) – Convergence statistic to be plotted (eg: delta_max, delta_mean, time, or policy_changes)
y_label (str) – Optional label for y_axis (if omitted, will use y as default name unless axis is already labeled)
x (str) – X axis data (typically ‘iteration’, but could be any convergence data)
x_label (str) – Optional label for x_axis (if omitted, will use ‘Iteration’)
label (str) – Optional label for the data set (shows up in axes legend)
ax (Axes) – Optional Matplotlib Axes object to add this line to
savefig (str) – Optional filename to save the figure to
kwargs – Additional args passed to matplotlib’s plot
- Returns
Matplotlib axes object
- Return type
Axes
-
lrl.utils.plotting.
plot_env
(env, ax=None, edgecolor='k', resize_figure=True, savefig=None)¶ Plot the map of an environment
- Parameters
env – Environment to plot
ax (axes) – (Optional) Axes object to plot on
edgecolor (str) – Color of the edge of each grid square (matplotlib format)
resize_figure (bool) –
If true, resize the figure to:
width = 0.5 * n_cols inches
height = 0.5 * n_rows inches
savefig (str) – If not None, save the figure to this filename
- Returns
Matplotlib axes object
- Return type
Axes
-
lrl.utils.plotting.
plot_solver_results
(env, solver=None, policy=None, value=None, savefig=None, **kwargs)¶ Convenience function to plot results from a solver over the environment map
Input can be using a BaseSolver or child object, or by specifying policy and/or value directly via dict or DictWithHistory.
See plot_solver_result() for more info on generation of individual plots and additional arguments for color/precision.
- Parameters
env – Augmented OpenAI Gym-like environment object
solver (BaseSolver) – Solver object used to solve the environment
policy (dict, DictWithHistory) – Policy for the environment, keyed by integer state-index or tuples of state
value (dict, DictWithHistory) – Value function for the environment, keyed by integer state-index or tuples of state
savefig (str) – If not None, save figures to this name. For cases with multiple policies per grid square, this will be the suffix on the name (eg: for policy at Vx=1, Vy=2, we get name of savefig_1_2.png)
**kwargs (dict) – Other arguments passed to plot_solver_result
- Returns
list of Matplotlib Axes for the plots
- Return type
list
-
lrl.utils.plotting.
plot_policy
(env, policy, **kwargs)¶ Convenience binding for plot_policy_or_value(). See plot_policy_or_value for more detail
-
lrl.utils.plotting.
plot_value
(env, value, **kwargs)¶ Convenience binding for plot_policy_or_value(). See plot_policy_or_value for more detail
-
lrl.utils.plotting.
plot_solver_result
(env, policy=None, value=None, ax=None, add_env_to_plot=True, hide_terminal_locations=True, color='k', title=None, savefig=None, size_policy='auto', size_value='auto', value_precision=2)¶ Plot result for a single xy map using a numpy array of shaped policy and/or value
- Parameters
env (Racetrack, FrozenLake, other environment) – Instantiated environment object
policy (np.array) – Policy for each grid square in the environment, in the same shape as env.desc For plotting environments where we have multiple states for a given grid square (eg for Racetrack), will call plotting for each given additional state (eg: for v=(0, 0), v=(1, 0), ..)
value – (np.array): Value for each grid square in the environment, in the same shape as env.desc For plotting environments where we have multiple states for a given grid square (eg for Racetrack), will call plotting for each given additional state (eg: for v=(0, 0), v=(1, 0), ..)
ax (Axes) – (OPTIONAL) Matplotlib axes object to plot to
add_env_to_plot (bool) – If True, add the environment map to the axes before plotting policy using plot_env()
hide_terminal_locations (bool) – If True, all known terminal locations will have no text printed (as policy here doesn’t matter)
color (str) – Matplotlib color string denoting color of the text for policy/value
title (str) – (Optional) title added to the axes object
savefig (str) – (Optional) string filename to output the figure to
size_policy (str, numeric) –
(Optional) Specification of text font size for policy printing. One of:
’auto’: Will automatically choose a font size based on the number of characters to be printed
str or numeric: Interpreted as a Matplotlib style font size designation
size_value (str, numeric) – (Optional) Specification of text font size for value printing. Same interface as size_policy
value_precision (int) – Precision of value function to be included on figures
- Returns
Matplotlib Axes object
-
lrl.utils.plotting.
plot_episodes
(episodes, env=None, add_env_to_plot=True, max_episodes=100, alpha=None, color='k', title=None, ax=None, savefig=None)¶ Plot a list of episodes through an environment over a drawing of the environment
- Parameters
episodes (list, EpisodeStatistics) – Series of episodes to be plotted. If EpisodeStatistics instance, .episodes will be extracted
env – Environment traversed
add_env_to_plot (bool) – If True, use plot_env to plot the environment to the image
alpha (float) – (Optional) alpha (transparency) used for plotting the episode. If left as None, a value will be chosen based on the number of episodes to be plotted
color (str) – Matplotlib-style color designation
title (str) – (Optional) Title to be added to the axes
ax (axes) – (Optional) Matplotlib axes object to write the plot to
savefig (str) – (Optional) string filename to output the figure to
max_episodes (int) – Maximum number of episodes to add to the plot. If len(episodes) exceeds this value, randomly chosen episodes will be used
- Returns
Matplotlib Axes object with episodes plotted to it
-
lrl.utils.plotting.
plot_episode
(episode, env=None, add_env_to_plot=True, alpha=None, color='k', title=None, ax=None, savefig=None)¶ Plot a single episode (walk) through the environment
- Parameters
episode (list) – List of states encountered in the episode
env – Environment traversed
add_env_to_plot (bool) – If True, use plot_env to plot the environment to the image
alpha (float) – (Optional) alpha (transparency) used for plotting the episode.
color (str) – Matplotlib-style color designation
title (str) – (Optional) Title to be added to the axes
ax (axes) – (Optional) Matplotlib axes object to write the plot to
savefig (str) – (Optional) string filename to output the figure to
- Returns
Matplotlib Axes object with a single episode plotted to it
-
lrl.utils.plotting.
choose_text_size
(n_chars, boxsize=1.0)¶ Helper to choose an appropriate text size when plotting policies. Size is chosen based on length of text
Return is calibrated to something that typically looked nice in testing
- Parameters
n_chars – Text caption to be added to plot
boxsize (float) – Size of box inside which text should print nicely. Used as a scaling factor. Default is 1 inch
- Returns
Matplotlib-style text size argument
-
lrl.utils.plotting.
policy_dict_to_array
(env, policy_dict)¶ Convert a policy stored as a dictionary into a dictionary of one or more policy numpy arrays shaped like env.desc
Can also be used for a value_dict.
policy_dict is a dictionary relating state to policy at that state in one of several forms. The dictionary can be keyed by state-index or a tuple of state (eg: (x, y, [other_state]), with x=0 in left column, y=0 in bottom row). If using tuples of state, state may be more than just x,y location as shown above, eg: (x, y, v_x, v_y). If len(state_tuple) > 2, we must plot each additional state separately.
Translate policy_dict into a policy_list_of_tuples of:
[(other_state_0, array_of_policy_at_other_state_0), (other_state_1, array_of_policy_at_other_state_1), ... ]
where the array_of_policy_at_other_state_* is in the same shape as env.desc (eg: cell [3, 2] of the array is the policy for the env.desc[3, 2] location in the env).
Examples
If state is described by tuples of (x, y) (where there is a single unique state for each grid location), eg:
policy_dict = { (0, 0): policy_0_0, (0, 1): policy_0_1, (0, 2): policy_0_2, ... (1, 0): policy_2_1, (1, 1): policy_2_1, ... (xmax, ymax): policy_xmax_ymax, }
then a single-element list is returned of the form:
returned = [ (None, np_array_of_policy), ]
where np_array_of_policy is of the same shape as env.desc (eg: the map), with each element corresponding to the policy at that grid location (for example, cell [3, 2] of the array is the policy for the env.desc[3, 2] location in the env).
If state is described by tuples of (x, y, something_else, [more_something_else…]), for example if state = (x, y, Vx, Vy) like below:
policy_dict = { (0, 0, 0, 0): policy_0_0_0_0, (0, 0, 1, 0): policy_0_0_1_0, (0, 0, 0, 1): policy_0_0_0_1, ... (1, 0, 0, 0): policy_1_0_0_0, (1, 0, 0, 1): policy_1_0_0_1, ... (xmax, ymax, Vxmax, Vymax): policy_xmax_ymax_Vxmax_Vymax, }
then a list is returned of the form:
returned = [ # (other_state, np_array_of_policies_for_this_other_state) ((0, 0), np_array_of_policies_with_Vx-0_Vy-0), ((1, 0), np_array_of_policies_with_Vx-0_Vy-0), ((0, 1), np_array_of_policies_with_Vx-0_Vy-0), ... ((Vxmax, Vymax), np_array_of_policies_with_Vxmax_Vymax), ]
where each element corresponds to a different combination of all the non-location state. This means that each element of the list is:
(Identification_of_this_case, shaped_xy-grid_of_policies_for_this_case)
and can be easily plotted over the environment’s map.
If policy_dict is keyed by state-index rather than state directly, the same logic as above still applies.
Notes
If using an environment (with policy keyed by either index or state) that has more than one unique state per grid location (eg: state has more than (x, y)), then environment must also have an index_to_state attribute to identify overlapping states. This constraint exists both for policies keyed by index or state, but the code could be refactored to avoid this limitation for state-keyed policies if required.
- Parameters
env – Augmented OpenAI Gym-like environment object
policy_dict (dict) – Dictionary of policy for the environment, keyed by integer state-index or tuples of state
- Returns
list of (description, shaped_policy) elements as described above
-
lrl.utils.plotting.
get_ax
(ax)¶ Returns figure and axes objects associated with an axes, instantiating if input is None
Data Stores¶
-
class
lrl.data_stores.
GeneralIterationData
(columns=None)¶ Bases:
object
Class to store data about solver iterations
Data is stored as a list of dictionaries. This is a placeholder for more advanced storage. Class gives a minimal set of extra bindings for convenience.
The present object has no checks to ensure consistency between added records (all have same fields, etc.). If any columns are missing from an added record, outputting to a dataframe will result in Pandas treating these as missing columns from a record.
- Parameters
columns (list) – An optional list of column names for the data (if specified, this sets the order of the columns in any output Pandas DataFrame or csv)
DOCTODO: Add example of usage
-
columns
= None¶ Column names used for data output.
If specified, this sets the order of any columns being output to Pandas DataFrame or csv
- Type
list
-
data
= None¶ List of dictionaries representing records.
Intended to be internal in future, but public at present to give easy access to records for slicing
- Type
list
-
add
(d)¶ Add a dictionary record to the data structure.
- Parameters
d (dict) – Dictionary of data to be stored
- Returns
None
-
get
(i=-1)¶ Return the ith entry in the data store (index of storage is in order in which data is committed to this object)
- Parameters
i (int) – Index of data to return (can be any valid list index, including -1 and slices)
- Returns
ith entry in the data store
- Return type
dict
-
to_dataframe
()¶ Returns the data structure as a Pandas DataFrame
- Returns
Pandas DataFrame of the data
- Return type
dataframe
-
to_csv
(filename, **kwargs)¶ Write data structure to a csv via the Pandas DataFrame
- Parameters
filename (str) – Filename or full path to output data to
kwargs (dict) – Optional arguments to be passed to DataFrame.to_csv()
- Returns
None
-
class
lrl.data_stores.
DictWithHistory
(timepoint_mode='explicit', tolerance=1e-07)¶ Bases:
collections.abc.MutableMapping
Dictionary-like object that maintains a history of all changes, either incrementally or at set timepoints
This object has access like a dictionary, but stores data internally such that the user can later recreate the state of the data from a past timepoint.
The intended use of this object is to store large objects which are iterated on (such as value or policy functions) in a way that a history of changes can be reproduced without having to store a new copy of the object every time. For example, when doing 10000 episodes of Q-Learning in a grid world with 2500 states, we can retain the full policy history during convergence (eg: answer “what was my policy after episode 527”) without keeping 10000 copies of a nearly-identical 2500 element numpy array or dict. The cost for this is some computation, although this generally has not been seen to be too significant (~10’s of seconds for a large Q-Learning problem in testing)
- Parameters
timepoint_mode (str) – One of:
explicit (*) – Timepoint incrementing is handled explicitly by the user (the timepoint only changes if the user invokes .update_timepoint()
implicit (*) – Timepoint incrementing is automatic and occurs on every setting action, including redundant sets (setting a key to a value it already holds). This is useful for a timehistory of all sets to the object
tolerance (float) – Absolute tolerance to test for when replacing values. If a value to be set is less than tolerance different from the current value, the current value is not changed.
Warning
Deletion of keys is not specifically supported. Deletion likely works for the most recent timepoint, but the history does not handle deleted keys properly
Numeric data may work best due to how new values are compared to existing data, although tuples have also been tested. See __setitem__ for more detail
DOCTODO: Add example
-
timepoint_mode
= None¶ See Parameters for definition
- Type
str
-
current_timepoint
= None¶ Timepoint that will be written to next
- Type
int
-
__getitem__
(key)¶ Return the most recent value for key
- Returns
Whatever is contained in ._data[key][-1][-1] (return only the value from the most recent timepoint, not the timepoint associated with it)
-
__setitem__
(key, value)¶ Set the value at a key if it is different from the current data stored at key
Data stored here is stored under the self.current_timepoint.
Difference between new and current values is assessed by testing:
new_value == old_value
np.isclose(new_value, old_value)
where if neither returns True, the new value is taken to be different from the current value
- Side Effects:
If timepoint_mode == ‘implicit’, self.current_timepoint will be incremented after setting data
- Parameters
key (immutable) – Key under which data is stored
value – Value to store at key
- Returns
None
-
update
(d)¶ Update this instance with a dictionary of data, d (similar to dict.update())
Keys in d that are present in this object overwrite the previous value. Keys in d that are missing in this object are added.
All data written from d is given the same timepoint (even if timepoint_mode=implicit) - the addition is treated as a single update to the object rather than a series of updates.
- Parameters
d (dict) – Dictionary of data to be added here
- Returns
None
-
get_value_history
(key)¶ Returns a list of tuples of the value at a given key over the entire history of that key
- Parameters
key (immutable) – Any valid dictionary key
- Returns
list containing tuples of:
timepoint (int): Integer timepoint for this value
value (float): The value of key at the corresponding timepoint
- Return type
(list)
-
get_value_at_timepoint
(key, timepoint=-1)¶ Returns the value corresponding to a key at the timepoint that is closest to but not greater than timepoint
Raises a KeyError if key did not exist at timepoint. Raises an IndexError if no timepoint exists that applies
- Parameters
key (immutable) – Any valid dictionary key
timepoint (int) – Integer timepoint to return value for. If negative, it is interpreted like typical python indexing (-1 means most recent, -2 means second most recent, …)
- Returns
Value corresponding to key at the timepoint closest to but not over timepoint
- Return type
numeric
-
to_dict
(timepoint=-1)¶ Return the state of the data at a given timepoint as a dict
- Parameters
timepoint (int) – Integer timepoint to return data as of. If negative, it is interpreted like typical python indexing (-1 means most recent, -2 means second most recent, …)
- Returns
Data at timepoint
- Return type
dict
-
clear
() → None. Remove all items from D.¶
-
get
(k[, d]) → D[k] if k in D, else d. d defaults to None.¶
-
increment_timepoint
()¶ Increments the timepoint at which the object is currently writing
- Returns
None
-
items
() → a set-like object providing a view on D's items¶
-
keys
() → a set-like object providing a view on D's keys¶
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised.
-
popitem
() → (k, v), remove and return some (key, value) pair¶ as a 2-tuple; but raise KeyError if D is empty.
-
setdefault
(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶
-
values
() → an object providing a view on D's values¶
-
class
lrl.data_stores.
EpisodeStatistics
¶ Bases:
object
Container for statistics about a set of independent episodes through an environment, typically following one policy
Statistics are lazily computed and memorized
DOCTODO: Add example usage. show plot_episodes
-
rewards
= None¶ List of the total reward for each episode (raw data)
- Type
list
-
episodes
= None¶ List of all episodes passed to the data object (raw data)
- Type
list
-
steps
= None¶ List of the total steps taken for each episode (raw data)
- Type
list
-
terminals
= None¶ List of whether each input episode was terminal (raw data)
- Type
list
-
add
(reward, episode, terminal)¶ Add an episode to the data store
- Parameters
reward (float) – Total reward from the episode
episode (list) – List of states encoutered in the episode, including the starting and final state
terminal (bool) – Boolean indicating if episode was terminal (did environment say episode has ended)
- Returns
None
-
get_statistic
(statistic='reward_mean', index=-1)¶ Return a lazily computed and memorized statistic about the rewards from episodes 0 to index
If the statistic has not been previous computed, it will be computed and returned. See .get_statistics() for list of statistics available
- Side Effects:
self.statistics[index] will be computed using self.compute() if it has not been already
- Parameters
statistic (str) – See .compute() for available statistics
index (int) – Episode index for requested statistic
Notes
Statistics are computed for all episodes up to and including the requested statistic. For example if episodes have rewards of [1, 3, 5, 10], get_statistic(‘reward_mean’, index=2) returns 3 (mean of [1, 3, 5]).
DOCTODO: Example usage (show getting some statistics)
- Returns
Value of the statistic requested
- Return type
int, float
-
get_statistics
(index=-1)¶ Return a lazily computed and memorized dictionary of all statistics about episodes 0 to index
If the statistic has not been previous computed, it will be computed here.
- Side Effects:
self.statistics[index] will be computed using self.compute() if it has not been already
- Parameters
index (int) – Episode index for requested statistic
- Returns
Details and statistics about this iteration, with keys:
Details about this iteration:
episode_index (int): Index of episode
terminal (bool): Boolean of whether this episode was terminal
reward (float): This episode’s reward (included to give easy access to per-iteration data)
steps (int): This episode’s steps (included to give easy access to per-iteration data)
Statistics computed for all episodes up to and including this episode:
reward_mean (float):
reward_median (float):
reward_std (float):
reward_max (float):
reward_min (float):
steps_mean (float):
steps_median (float):
steps_std (float):
steps_max (float):
steps_min (float):
terminal_fraction (float):
- Return type
dict
-
compute
(index=-1, force=False)¶ Compute and store statistics about rewards and steps for episodes up to and including the indexth episode
- Side Effects:
self.statistics[index] will be updated
- Parameters
index (int or 'all') – If integer, the index of the episode for which statistics are computed. Eg: If index==3, compute the statistics (see get_statistics() for list) for the series of episodes from 0 up to and not including 3 (typical python indexing rules) If ‘all’, compute statistics for all indices, skipping any that have been previously computed unless force == True
force (bool) –
If True, always recompute statistics even if they already exist.
If False, only compute if no previous statistics exist.
- Returns
None
-
to_dataframe
(include_episodes=False)¶ Return a Pandas DataFrame of the episode statistics
See .get_statistics() for a definition of each column. Order of columns is set through self.statistics_columns
- Parameters
include_episodes (bool) – If True, add column including the entire episode for each iteration
- Returns
Pandas DataFrame
-
to_csv
(filename, **kwargs)¶ Write statistics to csv via the Pandas DataFrame
See .get_statistics() for a definition of each column. Order of columns is set through self.statistics_columns
- Parameters
filename (str) – Filename or full path to output data to
kwargs (dict) – Optional arguments to be passed to DataFrame.to_csv()
- Returns
None
-
Miscellaneous Utilities¶
-
class
lrl.utils.misc.
Timer
¶ Bases:
object
A Simple Timer class for timing code
-
start
= None¶ timeit.default_timer object initialized at instantiation
-
elapsed
()¶ Return the time elapsed since this object was instantiated, in seconds
- Returns
Time elapsed in seconds
- Return type
float
-
-
lrl.utils.misc.
print_dict_by_row
(d, fmt='{key:20s}: {val:d}')¶ Print a dictionary with a little extra structure, printing a different key/value to each line.
- Parameters
d (dict) – Dictionary to be printed
fmt (str) – Format string to be used for printing. Must contain key and val formatting references
- Returns
None
-
lrl.utils.misc.
count_dict_differences
(d1, d2, keys=None, raise_on_missing_key=True, print_differences=False)¶ Return the number of differences between two dictionaries. Useful to compare two policies stored as dictionaries.
Does not properly handle floats that are approximately equal. Mainly use for int and objects with __eq__
Optionally raise an error on missing keys (otherwise missing keys are counted as differences)
- Parameters
d1 (dict) – Dictionary to compare
d2 (dict) – Dictionary to compare
keys (list) – Optional list of keys to consider for differences. If None, all keys will be considered
raise_on_missing_key (bool) – If true, raise KeyError on any keys not shared by both dictionaries
print_differences (bool) – If true, print all differences to screen
- Returns
Number of differences between the two dictionaries
- Return type
int
-
lrl.utils.misc.
dict_differences
(d1, d2)¶ Return the maximum and mean of the absolute difference between all elements of two dictionaries of numbers
- Parameters
d1 (dict) – Dictionary to compare
d2 (dict) – Dictionary to compare
- Returns
tuple containing:
float: Maximum elementwise difference
float: Sum of elementwise differences
- Return type
tuple
-
lrl.utils.misc.
rc_to_xy
(row, col, rows)¶ Convert from (row, col) coordinates (eg: numpy array) to (x, y) coordinates (bottom left = 0,0)
(x, y) convention
(0,0) in bottom left
x +ve to the right
y +ve up
(row,col) convention:
(0,0) in top left
row +ve down
col +ve to the right
- Parameters
row (int) – row coordinate to be converted
col (int) – col coordinate to be converted
rows (int) – Total number of rows
- Returns
(x, y)
- Return type
tuple
-
lrl.utils.misc.
params_to_name
(params, n_chars=4, sep='_', first_fields=None, key_remap=None)¶ Convert a mappable of parameters into a string for easy test naming
Warning
Currently includes hard-coded formatting that interprets keys named ‘alpha’ or ‘epsilon’
- Parameters
params (dict) – Dictionary to convert to a string
n_chars (int) – Number of characters per key to add to string. Eg: if key=’abcdefg’ and n_chars=4, output will be ‘abcd’
sep (str) – Separator character between fields (uses one of these between key and value, and two between different key-value pairs
first_fields (list) – Optional list of keys to write ahead of other keys (otherwise, output order it sorted)
key_remap (list) – List of dictionaries of {key_name: new_key_name} for rewriting keys into more readable strings
- Returns
- Return type
str