In [None]:
import gym
import numpy as np

In [None]:
env = gym.make("FrozenLake-v0")

We will be using OpenAI Gym. In Gym, every environment has a state and action space, accessible via `env.action_space` and `env.observation_space`

The underlying source code for the environment is available at: https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py

In [None]:
action_size = env.action_space.n
state_size = env.observation_space.n
print(f"Action size: {action_size}")
print(f"State size: {state_size}")

Action size: 4
State size: 16


In [None]:
# action space is represnted by a Discrete object
print(env.action_space)

Discrete(4)


In [None]:
def get_random_trajectory(render=False):
    ob = env.reset()
    traj_length = 0
    rewards = []
    while True:
        action = env.action_space.sample()
        ob, reward, done, _ = env.step(action)
        traj_length += 1
        rewards.append(reward)
        if done:
            break
        if render:
            env.render()
    return rewards, traj_length

In [None]:
rewards, traj_length  = get_random_trajectory(False)
print(np.sum(rewards))

0.0


To implement value iteration and policy iteration, we need the underlying transition distributions. This is in general not available in Gym environments, but we can access it for FrozenLake.

`env.env.P` is a dictionary containing the underlying transition and reward dynamics for the environment.

In [None]:
print(env.env.P.keys())
transition_dict = env.env.P

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])


We can view the transition distribution for the initial state. Each key corresponds to an action, and the values give transition probabilities. The possible transitions are specified via tuples representing the probability of the transition, the next state, the reward, and whether the episode terminates.

Actions are 0-indexed and correspond to ["Left", "Down", "Right", "Up"].

In [None]:
init_state = transition_dict[0]
actions = ["Left", "Down", "Right", "Up"]
for k in init_state:
    print(f"action {k}: transitions (p, ns, r, d):  {init_state[k]}")

action 0: transitions (p, ns, r, d):  [(0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 4, 0.0, False)]
action 1: transitions (p, ns, r, d):  [(0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False)]
action 2: transitions (p, ns, r, d):  [(0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False)]
action 3: transitions (p, ns, r, d):  [(0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 0, 0.0, False)]


In [None]:
# helper function to print policy
def print_policy(policy):
    reshaped_policy = policy.reshape(4,4)
    for i in range(4):
        x = " "
        for j in range(4):
            x += actions[int(reshaped_policy[i][j])]
            if j < 3:
                x += " | "
        print(x)


In [None]:
policy = np.random.randint(4, size=16)
print("Random policy: ")
print_policy(policy)


Random policy: 
 Down | Down | Up | Down
 Up | Right | Down | Down
 Left | Down | Left | Down
 Left | Left | Left | Down


We have provided some possible functions and their signatures, though you are certainly free to modify as you see fit.

In [1]:
def value_iteration(values, gamma, iterations=100, termination=1e-4):
    for _ in range(iterations):
        max_update = 0
        # can make asynchronous updates to values
        for i in range(16):
            # ********** TODO ***********
            pass
            # ********** TODO ***********
        # terminate if values don't change much    
        if max_update < termination:
            break
    return policy, values

# estimate values
def policy_evaluation(policy, init_values, gamma, termination=1e-4):
    # ********** TODO ***********
    return values

# update actions
def policy_improvement(values, gamma):
    # ********** TODO ***********
    return policy







    