Example Case using Racetrack

Below is an example of how to initialize the Racetrack environment and solve/compare with multiple solvers.

Boilerplate

If you’re playing with things under the hood as you run these, autoreload is always useful…

[1]:
%load_ext autoreload

%autoreload 2

If necessary, add directory containing lrl to path (workaround for if lrl is not installed as a package)

[2]:
import sys

# Path to directory containing lrl
sys.path.append('../')
[3]:
from lrl import environments, solvers
from lrl.utils import plotting

import matplotlib.pyplot as plt

Logging is used throughout lrl for basic info and debugging.

[4]:
import logging
logging.basicConfig(format='%(asctime)s - %(name)s - %(funcName)s - %(levelname)s - %(message)s',
                    level=logging.INFO, datefmt='%H:%M:%S')
logger = logging.getLogger(__name__)

Initialize an Environment

Initialize the 20x10 racetrack that includes some oily (stochastic) surfaces.

Note: Make sure that your velocity limits suit your track. A track must have a grass padding around the entire course that prevents a car from trying to exit the track entirely, so if max(abs(vel))==3, you need 3 grass tiles around the outside perimeter of your map. For track, we have a 2-tile perimeter so velocity must be less than +-2.

[5]:
# This will raise an error due to x_vel max limit
try:
    rt = environments.get_racetrack(track='20x10_U',
                                     x_vel_limits=(-2, 20),  # Note high x_vel upper limit
                                     y_vel_limits=(-2, 2),
                                     x_accel_limits=(-2, 2),
                                     y_accel_limits=(-2, 2),
                                     max_total_accel=2,
                                     )
except IndexError as e:
    print("Caught the following error while building a track that shouldn't work:")
    print(e)
    print("")

# This will work
try:
    rt = environments.get_racetrack(track='20x10_U',
                                 x_vel_limits=(-2, 2),
                                 y_vel_limits=(-2, 2),
                                 x_accel_limits=(-2, 2),
                                 y_accel_limits=(-2, 2),
                                 max_total_accel=2,
                                 )
    print("But second track built perfectly!")
except:
    print("Something went wrong, we shouldn't be here")
Caught the following error while building a track that shouldn't work:
Caught IndexError while building Racetrack.  Likely cause is a max velocity that is creater than the wall padding around the track (leading to a car that can exit the track entirely)

But second track built perfectly!

Take a look at the track using plot_env

[6]:
plotting.plot_env(env=rt)
[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f32321ae48>
_images/example_solution_racetrack_13_1.png

There are also additional maps available - see the racetrack code base for more

[7]:
print(f'Available tracks: {list(environments.racetrack.TRACKS.keys())}')
Available tracks: ['3x4_basic', '5x4_basic', '10x10', '10x10_basic', '10x10_all_oil', '15x15_basic', '20x20_basic', '20x20_all_oil', '30x30_basic', '20x10_U_all_oil', '20x10_U', '10x10_oil', '20x15_risky']

Tracks are simply lists of strings using a specific set of characters. See the racetrack code for more detail on how to make your own

[8]:
for line in environments.racetrack.TRACKS['20x10_U']:
    print(line)
GGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGG
GGG     OOOO      GG
GGG     GGGG      GG
GGOOOOGGGGGGGGOOOOGG
GGOOOGGGGGGGGGGOOOGG
GG    GGGGGGGG    GG
GG      SGGGGG    GG
GGGGGGGGGGGGGGFFFFGG
GGGGGGGGGGGGGGFFFFGG

We can draw them using character art! For example, here is a custom track with more oil and a different shape than above…

[9]:
custom_track = \
"""GGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGG
GGGOOOOOOOOOOOOOOOGG
GGG     GGGG      GG
GG    GGGGGGGG    GG
GGOOOOOOSGGGGGOOOOGG
GGGGGGGGGGGGGGFFFFGG
GGGGGGGGGGGGGGFFFFGG"""
custom_track = custom_track.split('\n')
[10]:
rt_custom = environments.get_racetrack(track=custom_track,
                                 x_vel_limits=(-2, 2),
                                 y_vel_limits=(-2, 2),
                                 x_accel_limits=(-2, 2),
                                 y_accel_limits=(-2, 2),
                                 max_total_accel=2,
                                 )
[11]:
plotting.plot_env(env=rt_custom)
[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f325f99ac8>
_images/example_solution_racetrack_21_1.png

Solve with Value Iteration and Interrogate Solution

[12]:
rt_vi = solvers.ValueIteration(env=rt)
rt_vi.iterate_to_convergence()
16:33:21 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Solver iterating to convergence (Max delta in value function < 0.001 or iters>500)
16:33:23 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Solver converged to solution in 18 iterations

And we can then score our solution by running it multiple times through the environment

[13]:
scoring_data = rt_vi.score_policy(iters=500)

score_policy returns a EpisodeStatistics object that contains details from each episode taken during the scoring. Easiest way to interact with it is grabbing data as a dataframe

[14]:
print(f'type(scoring_data) = {type(scoring_data)}')
scoring_data_df = scoring_data.to_dataframe(include_episodes=True)
scoring_data_df.head(3)
type(scoring_data) = <class 'lrl.data_stores.data_stores.EpisodeStatistics'>
[14]:
episode_index reward steps terminal reward_mean reward_median reward_std reward_min reward_max steps_mean steps_median steps_std steps_min steps_max terminal_fraction episodes
0 0 91.0 11 True 91.0 91.0 0.0 91.0 91.0 11.0 11.0 0.0 11 11 1.0 [(8, 2, 0, 0), (6, 2, -2, 0), (5, 3, -1, 1), (...
1 1 91.0 11 True 91.0 91.0 0.0 91.0 91.0 11.0 11.0 0.0 11 11 1.0 [(8, 2, 0, 0), (6, 2, -2, 0), (5, 3, -1, 1), (...
2 2 91.0 11 True 91.0 91.0 0.0 91.0 91.0 11.0 11.0 0.0 11 11 1.0 [(8, 2, 0, 0), (6, 2, -2, 0), (5, 3, -1, 1), (...
[15]:
scoring_data_df.tail(3)
[15]:
episode_index reward steps terminal reward_mean reward_median reward_std reward_min reward_max steps_mean steps_median steps_std steps_min steps_max terminal_fraction episodes
497 497 91.0 11 True 90.473896 91.0 0.880581 89.0 91.0 11.526104 11.0 0.880581 11 13 1.0 [(8, 2, 0, 0), (6, 2, -2, 0), (5, 3, -1, 1), (...
498 498 91.0 11 True 90.474950 91.0 0.880013 89.0 91.0 11.525050 11.0 0.880013 11 13 1.0 [(8, 2, 0, 0), (6, 2, -2, 0), (5, 3, -1, 1), (...
499 499 91.0 11 True 90.476000 91.0 0.879445 89.0 91.0 11.524000 11.0 0.879445 11 13 1.0 [(8, 2, 0, 0), (6, 2, -2, 0), (5, 3, -1, 1), (...

Reward, Steps, and Terminal columns give data on that specific walk, whereas reward_mean, _median, etc. columns give aggregate scores up until that walk. For example:

[16]:
print(f'The reward obtained in the 499th episode was {scoring_data_df.loc[499, "reward"]}')
print(f'The mean reward obtained in the 0-499th episodes (inclusive) was {scoring_data_df.loc[499, "reward_mean"]}')
The reward obtained in the 499th episode was 91.0
The mean reward obtained in the 0-499th episodes (inclusive) was 90.476

And we can access the actual episode path for each episode

[17]:
print(f'Episode 0 (directly)          : {scoring_data.episodes[0]}')
print(f'Episode 0 (from the dataframe): {scoring_data_df.loc[0, "episodes"]}')
Episode 0 (directly)          : [(8, 2, 0, 0), (6, 2, -2, 0), (5, 3, -1, 1), (5, 5, 0, 2), (7, 7, 2, 2), (9, 7, 2, 0), (11, 7, 2, 0), (13, 6, 2, -1), (15, 4, 2, -2), (15, 2, 0, -2), (14, 0, -1, -2)]
Episode 0 (from the dataframe): [(8, 2, 0, 0), (6, 2, -2, 0), (5, 3, -1, 1), (5, 5, 0, 2), (7, 7, 2, 2), (9, 7, 2, 0), (11, 7, 2, 0), (13, 6, 2, -1), (15, 4, 2, -2), (15, 2, 0, -2), (14, 0, -1, -2)]

Plotting Results

And plot 100 randomly chosen episodes on the map, returned as a matplotlib axes

[18]:
ax_episodes = plotting.plot_episodes(episodes=scoring_data.episodes, env=rt, max_episodes=100, )
_images/example_solution_racetrack_35_0.png

score_policy also lets us use hypothetical scenario, such as what if we started in a different starting location. Let’s try that by starting in the top left ((x,y) location (3, 7) (x=0 at left, y=0 at bot)) with a velocity of (1, -1), and plot it to our existing axes in red.

[19]:
scoring_data_alternate = rt_vi.score_policy(iters=500, initial_state=(3, 7, -2, -2))
ax_episodes_with_alternate = plotting.plot_episodes(episodes=scoring_data_alternate.episodes, env=rt,
                                                    add_env_to_plot=False, color='r', ax=ax_episodes
#                                                     savefig='my_figure_file',  # If you wanted the figure to save directly
                                                                                 # to file, use savefig (used throughout
                                                                                 # lrl's plotting scripts)
                                                    )

# Must get_figure because we're reusing the figure from above and jupyter wont automatically reshow it
ax_episodes_with_alternate.get_figure()
[19]:
_images/example_solution_racetrack_37_0.png

Where we can see the optimal policy takes a first action of (+2, 0) resulting in a first velocity of (0, -2) to avoid hitting the grass on step 1, then it redirects back up and around the track (although it sometimes slips on (3, 5) and drives down to (3, 3) before recovering)

We can also look at the Value function and best policy for all states

NOTE: Because state is (x, y, v_x, v_y), it is hard to capture all results on our (x, y) maps. plot_solver_results in this case plots a map for each (v_x, v_y) combination, with the axis title denoting which is the case.

[20]:
# Sorry, this will plot 24 plots normally.
# To keep concise, we turn matplotlib inline plotting off then back on and selectively plot a few examples
%matplotlib agg
ax_results = plotting.plot_solver_results(env=rt, solver=rt_vi)
%matplotlib inline

C:\Users\Scribs\Anaconda3\envs\lrl\lib\site-packages\matplotlib\pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Plots are indexed by the additional state variables (v_x, v_y), so we can see the policy for v_x=0, v_y=0 like:

[21]:
ax_results[(0, 0)].get_figure()
[21]:
_images/example_solution_racetrack_42_0.png

Where we can see the estimated value of the starting (light green) location is 30.59 (including discounting and costs for steps).

We can also look at the policy for (-2, -2) (the velocity for our alternate start used above in red):

[22]:
ax_results[(-2, -2)].get_figure()
[22]:
_images/example_solution_racetrack_45_0.png

Where we in the top left tile (3, 7) the optimal policy is (2, 0), just as we saw in our red path plot above.

We can also see that tile (2, 2) (bottom left) has a -100 value because it is impossible to not crash with a (-2, -2) initialial velocity from tile (2, 2) given that our acceleration limit is 2.

Solving with Policy Iteration and Comparing to Value Iteration

We can also use other solvers

[23]:
rt_pi = solvers.PolicyIteration(env=rt)
rt_pi.iterate_to_convergence()
16:33:37 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Solver iterating to convergence (1 iteration without change in policy or iters>500)
16:33:39 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Solver converged to solution in 6 iterations

We can look at how PI and VI converged relative to each other, comparing the maximum change in value function for each iteration

[24]:
# (these are simple convenience functions for plotting, basically just recipes.  See the plotting API)
# We can pass the solver..
ax = plotting.plot_solver_convergence(rt_vi, label='vi')

# Or going a little deeper into the API, with style being passed to matplotlib's plot function...
ax = plotting.plot_solver_convergence_from_df(rt_pi.iteration_data.to_dataframe(), y='delta_max', x='iteration', ax=ax, label='pi', ls='', marker='o')

ax.legend()
[24]:
<matplotlib.legend.Legend at 0x1f33ad329e8>
_images/example_solution_racetrack_51_1.png

And looking at policy changes per iteration

[25]:
# (these are simple convenience functions for plotting, basically just recipes.  See the plotting API)
# We can pass the solver..
ax = plotting.plot_solver_convergence(rt_vi, y='policy_changes', label='vi')

# Or going a little deeper into the API...
ax = plotting.plot_solver_convergence_from_df(rt_pi.iteration_data.to_dataframe(), y='policy_changes', x='iteration', ax=ax, label='pi', ls='', marker='o')

ax.legend()
[25]:
<matplotlib.legend.Legend at 0x1f33be75780>
_images/example_solution_racetrack_53_1.png

So we can see PI accomplishes more per iteration. But, is it faster? Let’s look at time per iteration

[26]:
# (these are simple convenience functions for plotting, basically just recipes.  See the plotting API)
# We can pass the solver..
ax = plotting.plot_solver_convergence(rt_vi, y='time', label='vi')

# Or going a little deeper into the API...
ax = plotting.plot_solver_convergence_from_df(rt_pi.iteration_data.to_dataframe(), y='time', x='iteration', ax=ax, label='pi', ls='', marker='o')

ax.legend()

print(f'Total solution time for Value Iteration (excludes any scoring time):  {rt_vi.iteration_data.to_dataframe().loc[:, "time"].sum():.2f}s')
print(f'Total solution time for Policy Iteration (excludes any scoring time): {rt_pi.iteration_data.to_dataframe().loc[:, "time"].sum():.2f}s')
Total solution time for Value Iteration (excludes any scoring time):  1.85s
Total solution time for Policy Iteration (excludes any scoring time): 2.26s
_images/example_solution_racetrack_55_1.png

Solve with Q-Learning

We can also use QLearning, although it needs a few parameters

[27]:
# Let's be explicit with our QLearning settings for alpha and epsilon
alpha = 0.1  # Constant alpha during learning

# Decay function for epsilon (see QLearning() and decay_functions() in documentation for syntax)
# Decay epsilon linearly from 0.2 at timestep (iteration) 0 to 0.05 at timestep 1500,
# keeping constant at 0.05 for ts>1500
epsilon = {
    'type': 'linear',
    'initial_value': 0.2,
    'initial_timestep': 0,
    'final_value': 0.05,
    'final_timestep': 1500
}

# Above PI/VI used the default gamma, but we will specify one here
gamma = 0.9

# Convergence is kinda tough to interpret automatically for Q-Learning.  One good way to monitor convergence is to
# evaluate how good the greedy policy at a given point in the solution is and decide if it is still improving.
# We can enable this with score_while_training (available for Value and Policy Iteration as well)
# NOTE: During scoring runs, the solver is acting greedily and NOT learning from the environment.  These are separate
#       runs solely used to estimate solution progress
# NOTE: Scoring every 50 iterations is probably a bit much, but used to show a nice plot below.  The default 500/500
#       is probably a better general guidance
score_while_training = {
    'n_trains_per_eval': 50,  # Number of training episodes we run per attempt to score the greedy policy
                               # (eg: Here we do a scoring run after every 500 training episodes, where training episodes
                               # are the usual epsilon-greedy exploration episodes)
    'n_evals': 250,  # Number of times we run through the env with the greedy policy whenever we score
}
# score_while_training = True  # This calls the default settings, which are also 500/500 like above

rt_ql = solvers.QLearning(env=rt, alpha=alpha, epsilon=epsilon, gamma=gamma,
                          max_iters=5000, score_while_training=score_while_training)

(note how long Q-Learning takes for this environment versus the planning algorithms)

[28]:
rt_ql.iterate_to_convergence()
16:33:41 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Solver iterating to convergence (20 episodes with max delta in Q function < 0.1 or iters>5000)
16:33:41 - lrl.solvers.learners - iterate - INFO - Performing iteration (episode) 0 of Q-Learning
16:33:41 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 50
16:33:42 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -103.0, r_max = -103.0
16:33:42 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 100
16:33:43 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -103.0, r_max = -103.0
16:33:43 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 150
16:33:43 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -103.0, r_max = -103.0
16:33:44 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 200
16:33:44 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -102.0, r_max = -102.0
16:33:44 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 250
16:33:45 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -103.424, r_max = -102.0
16:33:45 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 300
16:33:46 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -100.0, r_max = -100.0
16:33:46 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 350
16:33:46 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -103.0, r_max = -103.0
16:33:46 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 400
16:33:47 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.224, r_max = -104.0
16:33:47 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 450
16:33:47 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -103.0, r_max = -103.0
16:33:48 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 500
16:33:48 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.22, r_max = -104.0
16:33:48 - lrl.solvers.learners - iterate - INFO - Performing iteration (episode) 500 of Q-Learning
16:33:48 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 550
16:33:49 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.264, r_max = -104.0
16:33:49 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 600
16:33:50 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.0, r_max = -104.0
16:33:50 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 650
16:33:50 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.0, r_max = -104.0
16:33:50 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 700
16:33:51 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.228, r_max = -104.0
16:33:51 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 750
16:33:52 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -106.0, r_max = -106.0
16:33:52 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 800
16:33:52 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.464, r_max = -104.0
16:33:53 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 850
16:33:53 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -105.0, r_max = -105.0
16:33:53 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 900
16:33:54 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -105.0, r_max = -105.0
16:33:54 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 950
16:33:54 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -106.824, r_max = -105.0
16:33:55 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1000
16:33:55 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.54, r_max = -104.0
16:33:55 - lrl.solvers.learners - iterate - INFO - Performing iteration (episode) 1000 of Q-Learning
16:33:55 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1050
16:33:56 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.0, r_max = -104.0
16:33:56 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1100
16:33:57 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -100.0, r_max = -100.0
16:33:57 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1150
16:33:57 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -103.796, r_max = -103.0
16:33:58 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1200
16:33:58 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -106.128, r_max = -106.0
16:33:58 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1250
16:33:59 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.0, r_max = -104.0
16:33:59 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1300
16:34:00 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -107.852, r_max = -107.0
16:34:00 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1350
16:34:00 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -105.416, r_max = -104.0
16:34:01 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1400
16:34:01 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -100.0, r_max = -100.0
16:34:02 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1450
16:34:02 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -108.08, r_max = -108.0
16:34:02 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1500
16:34:03 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -105.84, r_max = -105.0
16:34:03 - lrl.solvers.learners - iterate - INFO - Performing iteration (episode) 1500 of Q-Learning
16:34:03 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1550
16:34:03 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -108.224, r_max = -108.0
16:34:04 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1600
16:34:04 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.0, r_max = -104.0
16:34:04 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1650
16:34:05 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -106.0, r_max = -106.0
16:34:05 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1700
16:34:06 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -100.0, r_max = -100.0
16:34:06 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1750
16:34:07 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.78, r_max = -104.0
16:34:07 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1800
16:34:07 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -112.76, r_max = -111.0
16:34:08 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1850
16:34:08 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -104.732, r_max = -104.0
16:34:08 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1900
16:34:09 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -117.248, r_max = -117.0
16:34:09 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 1950
16:34:10 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -100.0, r_max = -100.0
16:34:10 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2000
16:34:10 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -106.0, r_max = -106.0
16:34:10 - lrl.solvers.learners - iterate - INFO - Performing iteration (episode) 2000 of Q-Learning
16:34:11 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2050
16:34:11 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -107.12, r_max = -100.0
16:34:12 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2100
16:34:12 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -115.0, r_max = -115.0
16:34:12 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2150
16:34:13 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -102.516, r_max = -100.0
16:34:13 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2200
16:34:14 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -105.72, r_max = -105.0
16:34:14 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2250
16:34:14 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -109.88, r_max = -109.0
16:34:15 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2300
16:34:15 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -105.496, r_max = -105.0
16:34:16 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2350
16:34:16 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -108.12, r_max = -104.0
16:34:16 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2400
16:34:17 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -107.168, r_max = -100.0
16:34:17 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2450
16:34:17 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -107.34, r_max = -106.0
16:34:18 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2500
16:34:18 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -105.568, r_max = -105.0
16:34:18 - lrl.solvers.learners - iterate - INFO - Performing iteration (episode) 2500 of Q-Learning
16:34:18 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2550
16:34:19 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -106.0, r_max = -106.0
16:34:19 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2600
16:34:20 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -108.0, r_max = -108.0
16:34:20 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2650
16:34:21 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -111.952, r_max = -108.0
16:34:21 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2700
16:34:21 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -107.512, r_max = -106.0
16:34:22 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2750
16:34:22 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -120.848, r_max = -115.0
16:34:22 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2800
16:34:23 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -110.376, r_max = -110.0
16:34:23 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2850
16:34:24 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -112.776, r_max = -111.0
16:34:24 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2900
16:34:24 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -103.8, r_max = -100.0
16:34:25 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 2950
16:34:25 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -107.868, r_max = -106.0
16:34:26 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3000
16:34:26 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -101.668, r_max = -100.0
16:34:26 - lrl.solvers.learners - iterate - INFO - Performing iteration (episode) 3000 of Q-Learning
16:34:27 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3050
16:34:27 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -49.848, r_max = 89.0
16:34:27 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3100
16:34:28 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -101.392, r_max = -100.0
16:34:28 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3150
16:34:29 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -114.236, r_max = 77.0
16:34:29 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3200
16:34:30 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -106.996, r_max = -106.0
16:34:30 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3250
16:34:30 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -124.144, r_max = -100.0
16:34:31 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3300
16:34:31 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -109.284, r_max = -106.0
16:34:32 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3350
16:34:32 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -102.112, r_max = -100.0
16:34:33 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3400
16:34:33 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -114.396, r_max = -105.0
16:34:34 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3450
16:34:34 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -100.0, r_max = -100.0
16:34:35 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3500
16:34:35 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -100.0, r_max = -100.0
16:34:35 - lrl.solvers.learners - iterate - INFO - Performing iteration (episode) 3500 of Q-Learning
16:34:36 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3550
16:34:36 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 36.04, r_max = 89.0
16:34:37 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3600
16:34:37 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -109.624, r_max = -107.0
16:34:37 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3650
16:34:38 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = -87.048, r_max = 87.0
16:34:38 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3700
16:34:39 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.472, r_max = 91.0
16:34:39 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3750
16:34:40 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.536, r_max = 91.0
16:34:40 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3800
16:34:40 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.488, r_max = 91.0
16:34:41 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3850
16:34:41 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.416, r_max = 91.0
16:34:41 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3900
16:34:42 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.464, r_max = 91.0
16:34:42 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 3950
16:34:42 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.392, r_max = 91.0
16:34:43 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4000
16:34:43 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.432, r_max = 91.0
16:34:43 - lrl.solvers.learners - iterate - INFO - Performing iteration (episode) 4000 of Q-Learning
16:34:44 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4050
16:34:44 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.52, r_max = 91.0
16:34:44 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4100
16:34:45 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.424, r_max = 91.0
16:34:45 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4150
16:34:45 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.4, r_max = 91.0
16:34:46 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4200
16:34:46 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.504, r_max = 91.0
16:34:46 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4250
16:34:47 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.544, r_max = 91.0
16:34:47 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4300
16:34:48 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.544, r_max = 91.0
16:34:48 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4350
16:34:48 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.424, r_max = 91.0
16:34:49 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4400
16:34:49 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.512, r_max = 91.0
16:34:49 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4450
16:34:50 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.552, r_max = 91.0
16:34:50 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4500
16:34:51 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.536, r_max = 91.0
16:34:51 - lrl.solvers.learners - iterate - INFO - Performing iteration (episode) 4500 of Q-Learning
16:34:51 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4550
16:34:51 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.544, r_max = 91.0
16:34:52 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4600
16:34:52 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.48, r_max = 91.0
16:34:52 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4650
16:34:53 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.504, r_max = 91.0
16:34:53 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4700
16:34:54 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.544, r_max = 91.0
16:34:54 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4750
16:34:54 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.472, r_max = 91.0
16:34:55 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4800
16:34:55 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.728, r_max = 91.0
16:34:55 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4850
16:34:56 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.52, r_max = 91.0
16:34:56 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4900
16:34:56 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.584, r_max = 91.0
16:34:57 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 4950
16:34:57 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.496, r_max = 91.0
16:34:57 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy being scored 250 times at iteration 5000
16:34:58 - lrl.solvers.base_solver - iterate_to_convergence - INFO - Current greedy policy achieved: r_mean = 90.408, r_max = 91.0
16:34:58 - lrl.solvers.base_solver - iterate_to_convergence - WARNING - Max iterations (5000) reached - solver did not converge

Like above, we can plot the number of policy changes per iteration. But this plot looks very different from above and shows one view of why Q-Learning takes many more iterations (each iteration accomplishes a lot less learning than a planning algorithm)

[29]:
rt_ql_iter_df = rt_ql.iteration_data.to_dataframe()
rt_ql_iter_df.plot(x='iteration', y='policy_changes', kind='scatter', )
[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f33bf52cc0>
_images/example_solution_racetrack_62_1.png

We can access the intermediate scoring through the scoring_summary (GeneralIterationData) and scoring_episode_statistics (EpisodeStatistics) objects

[30]:
rt_ql_intermediate_scoring_df = rt_ql.scoring_summary.to_dataframe()
rt_ql_intermediate_scoring_df
[30]:
iteration reward_mean reward_median reward_std reward_min reward_max steps_mean steps_median steps_std steps_min steps_max
0 50 -103.000 -103.0 0.000000 -103.0 -103.0 5.000 5.0 0.000000 5 5
1 100 -103.000 -103.0 0.000000 -103.0 -103.0 5.000 5.0 0.000000 5 5
2 150 -103.000 -103.0 0.000000 -103.0 -103.0 5.000 5.0 0.000000 5 5
3 200 -102.000 -102.0 0.000000 -102.0 -102.0 4.000 4.0 0.000000 4 4
4 250 -103.424 -104.0 0.905662 -104.0 -102.0 5.424 6.0 0.905662 4 6
5 300 -100.000 -100.0 0.000000 -100.0 -100.0 101.000 101.0 0.000000 101 101
6 350 -103.000 -103.0 0.000000 -103.0 -103.0 5.000 5.0 0.000000 5 5
7 400 -104.224 -104.0 0.416922 -105.0 -104.0 6.224 6.0 0.416922 6 7
8 450 -103.000 -103.0 0.000000 -103.0 -103.0 5.000 5.0 0.000000 5 5
9 500 -104.220 -104.0 0.414246 -105.0 -104.0 6.220 6.0 0.414246 6 7
10 550 -104.264 -104.0 0.440799 -105.0 -104.0 6.264 6.0 0.440799 6 7
11 600 -104.000 -104.0 0.000000 -104.0 -104.0 6.000 6.0 0.000000 6 6
12 650 -104.000 -104.0 0.000000 -104.0 -104.0 6.000 6.0 0.000000 6 6
13 700 -104.228 -104.0 0.419543 -105.0 -104.0 6.228 6.0 0.419543 6 7
14 750 -106.000 -106.0 0.000000 -106.0 -106.0 8.000 8.0 0.000000 8 8
15 800 -104.464 -104.0 0.844218 -106.0 -104.0 6.464 6.0 0.844218 6 8
16 850 -105.000 -105.0 0.000000 -105.0 -105.0 7.000 7.0 0.000000 7 7
17 900 -105.000 -105.0 0.000000 -105.0 -105.0 7.000 7.0 0.000000 7 7
18 950 -106.824 -107.0 1.302699 -109.0 -105.0 8.824 9.0 1.302699 7 11
19 1000 -104.540 -104.0 1.152562 -107.0 -104.0 6.540 6.0 1.152562 6 9
20 1050 -104.000 -104.0 0.000000 -104.0 -104.0 6.000 6.0 0.000000 6 6
21 1100 -100.000 -100.0 0.000000 -100.0 -100.0 101.000 101.0 0.000000 101 101
22 1150 -103.796 -104.0 0.402969 -104.0 -103.0 5.796 6.0 0.402969 5 6
23 1200 -106.128 -106.0 0.489506 -108.0 -106.0 8.128 8.0 0.489506 8 10
24 1250 -104.000 -104.0 0.000000 -104.0 -104.0 6.000 6.0 0.000000 6 6
25 1300 -107.852 -107.0 1.352810 -110.0 -107.0 9.852 9.0 1.352810 9 12
26 1350 -105.416 -106.0 0.909365 -106.0 -104.0 7.416 8.0 0.909365 6 8
27 1400 -100.000 -100.0 0.000000 -100.0 -100.0 101.000 101.0 0.000000 101 101
28 1450 -108.080 -108.0 0.271293 -109.0 -108.0 10.080 10.0 0.271293 10 11
29 1500 -105.840 -105.0 1.474585 -112.0 -105.0 7.840 7.0 1.474585 7 14
... ... ... ... ... ... ... ... ... ... ... ...
70 3550 36.040 89.0 87.544037 -116.0 89.0 12.360 13.0 2.076150 9 18
71 3600 -109.624 -107.0 4.556602 -131.0 -107.0 11.624 9.0 4.556602 9 33
72 3650 -87.048 -100.0 53.641902 -113.0 87.0 64.860 101.0 43.202042 11 101
73 3700 90.472 91.0 0.881599 89.0 91.0 11.528 11.0 0.881599 11 13
74 3750 90.536 91.0 0.844218 89.0 91.0 11.464 11.0 0.844218 11 13
75 3800 90.488 91.0 0.872844 89.0 91.0 11.512 11.0 0.872844 11 13
76 3850 90.416 91.0 0.909365 89.0 91.0 11.584 11.0 0.909365 11 13
77 3900 90.464 91.0 0.885835 89.0 91.0 11.536 11.0 0.885835 11 13
78 3950 90.392 91.0 0.919965 89.0 91.0 11.608 11.0 0.919965 11 13
79 4000 90.432 91.0 0.901874 89.0 91.0 11.568 11.0 0.901874 11 13
80 4050 90.520 91.0 0.854166 89.0 91.0 11.480 11.0 0.854166 11 13
81 4100 90.424 91.0 0.905662 89.0 91.0 11.576 11.0 0.905662 11 13
82 4150 90.400 91.0 0.916515 89.0 91.0 11.600 11.0 0.916515 11 13
83 4200 90.504 91.0 0.863704 89.0 91.0 11.496 11.0 0.863704 11 13
84 4250 90.544 91.0 0.839085 89.0 91.0 11.456 11.0 0.839085 11 13
85 4300 90.544 91.0 0.839085 89.0 91.0 11.456 11.0 0.839085 11 13
86 4350 90.424 91.0 0.905662 89.0 91.0 11.576 11.0 0.905662 11 13
87 4400 90.512 91.0 0.858985 89.0 91.0 11.488 11.0 0.858985 11 13
88 4450 90.552 91.0 0.833844 89.0 91.0 11.448 11.0 0.833844 11 13
89 4500 90.536 91.0 0.844218 89.0 91.0 11.464 11.0 0.844218 11 13
90 4550 90.544 91.0 0.839085 89.0 91.0 11.456 11.0 0.839085 11 13
91 4600 90.480 91.0 0.877268 89.0 91.0 11.520 11.0 0.877268 11 13
92 4650 90.504 91.0 0.863704 89.0 91.0 11.496 11.0 0.863704 11 13
93 4700 90.544 91.0 0.839085 89.0 91.0 11.456 11.0 0.839085 11 13
94 4750 90.472 91.0 0.881599 89.0 91.0 11.528 11.0 0.881599 11 13
95 4800 90.728 91.0 0.685577 89.0 91.0 11.272 11.0 0.685577 11 13
96 4850 90.520 91.0 0.854166 89.0 91.0 11.480 11.0 0.854166 11 13
97 4900 90.584 91.0 0.811754 89.0 91.0 11.416 11.0 0.811754 11 13
98 4950 90.496 91.0 0.868323 89.0 91.0 11.504 11.0 0.868323 11 13
99 5000 90.408 91.0 0.912982 89.0 91.0 11.592 11.0 0.912982 11 13

100 rows × 11 columns

[31]:
plt.plot(rt_ql_intermediate_scoring_df.loc[:, 'iteration'], rt_ql_intermediate_scoring_df.loc[:, 'reward_mean'], '-o')
[31]:
[<matplotlib.lines.Line2D at 0x1f33df6ca90>]
_images/example_solution_racetrack_65_1.png

In this case we see somewhere between the 3500th and 4000th iteration the solver finds a solution and starts building a policy around it. This won’t always be the case and learning may be more incremental

And if we wanted to access the actual episodes that went into one of these datapoints, they’re available in a dictionary of EpisodeStatistics objects here (keyed by iteration number):

[33]:
i = 3750
print(f'EpisodeStatistics for the scoring at iter == {i}:\n')
rt_ql.scoring_episode_statistics[i].to_dataframe().head()
EpisodeStatistics for the scoring at iter == 3750:

[33]:
episode_index reward steps terminal reward_mean reward_median reward_std reward_min reward_max steps_mean steps_median steps_std steps_min steps_max terminal_fraction
0 0 91.0 11 True 91.0 91.0 0.000000 91.0 91.0 11.0 11.0 0.000000 11 11 1.0
1 1 91.0 11 True 91.0 91.0 0.000000 91.0 91.0 11.0 11.0 0.000000 11 11 1.0
2 2 91.0 11 True 91.0 91.0 0.000000 91.0 91.0 11.0 11.0 0.000000 11 11 1.0
3 3 89.0 13 True 90.5 91.0 0.866025 89.0 91.0 11.5 11.0 0.866025 11 13 1.0
4 4 89.0 13 True 90.2 91.0 0.979796 89.0 91.0 11.8 11.0 0.979796 11 13 1.0