# Markov Decision Process (MDP) Toolbox: mdp module¶

The mdp module provides classes for the resolution of descrete-time Markov Decision Processes.

## Available classes¶

MDP
Base Markov decision process class
FiniteHorizon
Backwards induction finite horizon MDP
PolicyIteration
Policy iteration MDP
PolicyIterationModified
Modified policy iteration MDP
QLearning
Q-learning MDP
RelativeValueIteration
Relative value iteration MDP
ValueIteration
Value iteration MDP
ValueIterationGS
Gauss-Seidel value iteration MDP
class mdptoolbox.mdp.FiniteHorizon(transitions, reward, discount, N, h=None, skip_check=False)[source]

Bases: mdptoolbox.mdp.MDP

A MDP solved using the finite-horizon backwards induction algorithm.

Parameters: transitions (array) – Transition probability matrices. See the documentation for the MDP class for details. reward (array) – Reward matrices or vectors. See the documentation for the MDP class for details. discount (float) – Discount factor. See the documentation for the MDP class for details. N (int) – Number of periods. Must be greater than 0. h (array, optional) – Terminal reward. Default: a vector of zeros. skip_check (bool) – By default we run a check on the transitions and rewards arguments to make sure they describe a valid MDP. You can set this argument to True in order to skip this check. Attributes (Data) – ————— – V (array) – Optimal value function. Shape = (S, N+1). V[:, n] = optimal value function at stage n with stage in {0, 1...N-1}. V[:, N] value function for terminal stage. policy (array) – Optimal policy. policy[:, n] = optimal policy at stage n with stage in {0, 1...N}. policy[:, N] = policy for stage N. time (float) – used CPU time

Notes

In verbose mode, displays the current stage and policy transpose.

Examples

>>> import mdptoolbox, mdptoolbox.example
>>> P, R = mdptoolbox.example.forest()
>>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3)
>>> fh.run()
>>> fh.V
array([[ 2.6973,  0.81  ,  0.    ,  0.    ],
[ 5.9373,  3.24  ,  1.    ,  0.    ],
[ 9.9373,  7.24  ,  4.    ,  0.    ]])
>>> fh.policy
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 0]])

run()[source]
setSilent()

Set the MDP algorithm to silent mode.

setVerbose()

Set the MDP algorithm to verbose mode.

class mdptoolbox.mdp.MDP(transitions, reward, discount, epsilon, max_iter, skip_check=False)[source]

Bases: builtins.object

A Markov Decision Problem.

Let S = the number of states, and A = the number of acions.

Parameters: transitions (array) – Transition probability matrices. These can be defined in a variety of ways. The simplest is a numpy array that has the shape (A, S, S), though there are other possibilities. It can be a tuple or list or numpy object array of length A, where each element contains a numpy array or matrix that has the shape (S, S). This “list of matrices” form is useful when the transition matrices are sparse as scipy.sparse.csr_matrix matrices can be used. In summary, each action’s transition matrix must be indexable like transitions[a] where a ∈ {0, 1...A-1}, and transitions[a] returns an S × S array-like object. reward (array) – Reward matrices or vectors. Like the transition matrices, these can also be defined in a variety of ways. Again the simplest is a numpy array that has the shape (S, A), (S,) or (A, S, S). A list of lists can be used, where each inner list has length S and the outer list has length A. A list of numpy arrays is possible where each inner array can be of the shape (S,), (S, 1), (1, S) or (S, S). Also scipy.sparse.csr_matrix can be used instead of numpy arrays. In addition, the outer list can be replaced by any object that can be indexed like reward[a] such as a tuple or numpy object array of length A. discount (float) – Discount factor. The per time-step discount factor on future rewards. Valid values are greater than 0 upto and including 1. If the discount factor is 1, then convergence is cannot be assumed and a warning will be displayed. Subclasses of MDP may pass None in the case where the algorithm does not use a discount factor. epsilon (float) – Stopping criterion. The maximum change in the value function at each iteration is compared against epsilon. Once the change falls below this value, then the value function is considered to have converged to the optimal value function. Subclasses of MDP may pass None in the case where the algorithm does not use an epsilon-optimal stopping criterion. max_iter (int) – Maximum number of iterations. The algorithm will be terminated once this many iterations have elapsed. This must be greater than 0 if specified. Subclasses of MDP may pass None in the case where the algorithm does not use a maximum number of iterations. skip_check (bool) – By default we run a check on the transitions and rewards arguments to make sure they describe a valid MDP. You can set this argument to True in order to skip this check.
P

array

Transition probability matrices.

R

array

Reward vectors.

V

tuple

The optimal value function. Each element is a float corresponding to the expected value of being in that state assuming the optimal policy is followed.

discount

float

The discount rate on future rewards.

max_iter

int

The maximum number of iterations.

policy

tuple

The optimal policy.

time

float

The time used to converge to the optimal policy.

verbose

boolean

Whether verbose output should be displayed or not.

run()[source]

Implemented in child classes as the main algorithm loop. Raises an exception if it has not been overridden.

setSilent()[source]

Turn the verbosity off

setVerbose()[source]

Turn the verbosity on

run()[source]

Raises error because child classes should implement this function.

setSilent()[source]

Set the MDP algorithm to silent mode.

setVerbose()[source]

Set the MDP algorithm to verbose mode.

class mdptoolbox.mdp.PolicyIteration(transitions, reward, discount, policy0=None, max_iter=1000, eval_type=0, skip_check=False)[source]

Bases: mdptoolbox.mdp.MDP

A discounted MDP solved using the policy iteration algorithm.

Parameters: transitions (array) – Transition probability matrices. See the documentation for the MDP class for details. reward (array) – Reward matrices or vectors. See the documentation for the MDP class for details. discount (float) – Discount factor. See the documentation for the MDP class for details. policy0 (array, optional) – Starting policy. max_iter (int, optional) – Maximum number of iterations. See the documentation for the MDP class for details. Default is 1000. eval_type (int or string, optional) – Type of function used to evaluate policy. 0 or “matrix” to solve as a set of linear equations. 1 or “iterative” to solve iteratively. Default: 0. skip_check (bool) – By default we run a check on the transitions and rewards arguments to make sure they describe a valid MDP. You can set this argument to True in order to skip this check. Attributes (Data) – ————— – V (tuple) – value function policy (tuple) – optimal policy iter (int) – number of done iterations time (float) – used CPU time

Notes

In verbose mode, at each iteration, displays the number of differents actions between policy n-1 and n

Examples

>>> import mdptoolbox, mdptoolbox.example
>>> P, R = mdptoolbox.example.rand(10, 3)
>>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
>>> pi.run()

>>> P, R = mdptoolbox.example.forest()
>>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
>>> pi.run()
>>> expected = (26.244000000000014, 29.484000000000016, 33.484000000000016)
>>> all(expected[k] - pi.V[k] < 1e-12 for k in range(len(expected)))
True
>>> pi.policy
(0, 0, 0)

run()[source]
setSilent()

Set the MDP algorithm to silent mode.

setVerbose()

Set the MDP algorithm to verbose mode.

class mdptoolbox.mdp.PolicyIterationModified(transitions, reward, discount, epsilon=0.01, max_iter=10, skip_check=False)[source]

A discounted MDP solved using a modifified policy iteration algorithm.

Parameters: transitions (array) – Transition probability matrices. See the documentation for the MDP class for details. reward (array) – Reward matrices or vectors. See the documentation for the MDP class for details. discount (float) – Discount factor. See the documentation for the MDP class for details. epsilon (float, optional) – Stopping criterion. See the documentation for the MDP class for details. Default: 0.01. max_iter (int, optional) – Maximum number of iterations. See the documentation for the MDP class for details. Default is 10. skip_check (bool) – By default we run a check on the transitions and rewards arguments to make sure they describe a valid MDP. You can set this argument to True in order to skip this check. Attributes (Data) – ————— – V (tuple) – value function policy (tuple) – optimal policy iter (int) – number of done iterations time (float) – used CPU time

Examples

>>> import mdptoolbox, mdptoolbox.example
>>> P, R = mdptoolbox.example.forest()
>>> pim = mdptoolbox.mdp.PolicyIterationModified(P, R, 0.9)
>>> pim.run()
>>> pim.policy
(0, 0, 0)
>>> expected = (21.81408652334702, 25.054086523347017, 29.054086523347017)
>>> all(expected[k] - pim.V[k] < 1e-12 for k in range(len(expected)))
True

run()[source]
setSilent()

Set the MDP algorithm to silent mode.

setVerbose()

Set the MDP algorithm to verbose mode.

class mdptoolbox.mdp.QLearning(transitions, reward, discount, n_iter=10000, skip_check=False)[source]

Bases: mdptoolbox.mdp.MDP

A discounted MDP solved using the Q learning algorithm.

Parameters: transitions (array) – Transition probability matrices. See the documentation for the MDP class for details. reward (array) – Reward matrices or vectors. See the documentation for the MDP class for details. discount (float) – Discount factor. See the documentation for the MDP class for details. n_iter (int, optional) – Number of iterations to execute. This is ignored unless it is an integer greater than the default value. Defaut: 10,000. skip_check (bool) – By default we run a check on the transitions and rewards arguments to make sure they describe a valid MDP. You can set this argument to True in order to skip this check. Attributes (Data) – ————— – Q (array) – learned Q matrix (SxA) V (tuple) – learned value function (S). policy (tuple) – learned optimal policy (S). mean_discrepancy (array) – Vector of V discrepancy mean over 100 iterations. Then the length of this vector for the default value of N is 100 (N/100). Examples – ——— – # These examples are reproducible only if random seed is set to 0 in (>>>) – # both the random and numpy.random modules. (>>>) – import numpy as np (>>>) – import mdptoolbox, mdptoolbox.example (>>>) – np.random.seed(0) (>>>) – P, R = mdptoolbox.example.forest() (>>>) – ql = mdptoolbox.mdp.QLearning(P, R, 0.96) (>>>) – ql.run() (>>>) – ql.Q (>>>) – 11.198909 , 10.34652034], (array([[) – [ 10.74229967, 11.74105792], [ 2.86980001, 12.25973286]]) expected = (11.198908998901134, 11.741057920409865, 12.259732864170232) (>>>) – all(expected[k] - ql.V[k] < 1e-12 for k in range(len(expected))) (>>>) – True – ql.policy (>>>) – 1, 1) ((0,) – import mdptoolbox (>>>) – import numpy as np – P = np.array([[[0.5, 0.5],[0.8, 0.2]],[[0, 1],[0.1, 0.9]]]) (>>>) – R = np.array([[5, 10], [-1, 2]]) (>>>) – np.random.seed(0) – ql = mdptoolbox.mdp.QLearning(P, R, 0.9) (>>>) – ql.run() – ql.Q – 33.33010866, 40.82109565], (array([[) – [ 34.37431041, 29.67236845]]) expected = (40.82109564847122, 34.37431040682546) (>>>) – all(expected[k] - ql.V[k] < 1e-12 for k in range(len(expected))) – True – ql.policy – 0) ((1,) –
run()[source]
setSilent()

Set the MDP algorithm to silent mode.

setVerbose()

Set the MDP algorithm to verbose mode.

class mdptoolbox.mdp.RelativeValueIteration(transitions, reward, epsilon=0.01, max_iter=1000, skip_check=False)[source]

Bases: mdptoolbox.mdp.MDP

A MDP solved using the relative value iteration algorithm.

Parameters: transitions (array) – Transition probability matrices. See the documentation for the MDP class for details. reward (array) – Reward matrices or vectors. See the documentation for the MDP class for details. epsilon (float, optional) – Stopping criterion. See the documentation for the MDP class for details. Default: 0.01. max_iter (int, optional) – Maximum number of iterations. See the documentation for the MDP class for details. Default: 1000. skip_check (bool) – By default we run a check on the transitions and rewards arguments to make sure they describe a valid MDP. You can set this argument to True in order to skip this check. Attributes (Data) – ————— – policy (tuple) – epsilon-optimal policy average_reward (tuple) – average reward of the optimal policy cpu_time (float) – used CPU time

Notes

In verbose mode, at each iteration, displays the span of U variation and the condition which stopped iterations : epsilon-optimum policy found or maximum number of iterations reached.

Examples

>>> import mdptoolbox, mdptoolbox.example
>>> P, R = mdptoolbox.example.forest()
>>> rvi = mdptoolbox.mdp.RelativeValueIteration(P, R)
>>> rvi.run()
>>> rvi.average_reward
3.2399999999999993
>>> rvi.policy
(0, 0, 0)
>>> rvi.iter
4

>>> import mdptoolbox
>>> import numpy as np
>>> P = np.array([[[0.5, 0.5],[0.8, 0.2]],[[0, 1],[0.1, 0.9]]])
>>> R = np.array([[5, 10], [-1, 2]])
>>> rvi = mdptoolbox.mdp.RelativeValueIteration(P, R)
>>> rvi.run()
>>> expected = (10.0, 3.885235246411831)
>>> all(expected[k] - rvi.V[k] < 1e-12 for k in range(len(expected)))
True
>>> rvi.average_reward
3.8852352464118312
>>> rvi.policy
(1, 0)
>>> rvi.iter
29

run()[source]
setSilent()

Set the MDP algorithm to silent mode.

setVerbose()

Set the MDP algorithm to verbose mode.

class mdptoolbox.mdp.ValueIteration(transitions, reward, discount, epsilon=0.01, max_iter=1000, initial_value=0, skip_check=False)[source]

Bases: mdptoolbox.mdp.MDP

A discounted MDP solved using the value iteration algorithm.

ValueIteration applies the value iteration algorithm to solve a discounted MDP. The algorithm consists of solving Bellman’s equation iteratively. Iteration is stopped when an epsilon-optimal policy is found or after a specified number (max_iter) of iterations. This function uses verbose and silent modes. In verbose mode, the function displays the variation of V (the value function) for each iteration and the condition which stopped the iteration: epsilon-policy found or maximum number of iterations reached.

Parameters: transitions (array) – Transition probability matrices. See the documentation for the MDP class for details. reward (array) – Reward matrices or vectors. See the documentation for the MDP class for details. discount (float) – Discount factor. See the documentation for the MDP class for details. epsilon (float, optional) – Stopping criterion. See the documentation for the MDP class for details. Default: 0.01. max_iter (int, optional) – Maximum number of iterations. If the value given is greater than a computed bound, a warning informs that the computed bound will be used instead. By default, if discount is not equal to 1, a bound for max_iter is computed, otherwise max_iter = 1000. See the documentation for the MDP class for further details. initial_value (array, optional) – The starting value function. Default: a vector of zeros. skip_check (bool) – By default we run a check on the transitions and rewards arguments to make sure they describe a valid MDP. You can set this argument to True in order to skip this check. Attributes (Data) – ————— – V (tuple) – The optimal value function. policy (tuple) – The optimal policy function. Each element is an integer corresponding to an action which maximises the value function in that state. iter (int) – The number of iterations taken to complete the computation. time (float) – The amount of CPU time used to run the algorithm.
run()[source]

Do the algorithm iteration.

setSilent()

Sets the instance to silent mode.

setVerbose()

Sets the instance to verbose mode.

Notes

In verbose mode, at each iteration, displays the variation of V and the condition which stopped iterations: epsilon-optimum policy found or maximum number of iterations reached.

Examples

>>> import mdptoolbox, mdptoolbox.example
>>> P, R = mdptoolbox.example.forest()
>>> vi = mdptoolbox.mdp.ValueIteration(P, R, 0.96)
>>> vi.verbose
False
>>> vi.run()
>>> expected = (5.93215488, 9.38815488, 13.38815488)
>>> all(expected[k] - vi.V[k] < 1e-12 for k in range(len(expected)))
True
>>> vi.policy
(0, 0, 0)
>>> vi.iter
4

>>> import mdptoolbox
>>> import numpy as np
>>> P = np.array([[[0.5, 0.5],[0.8, 0.2]],[[0, 1],[0.1, 0.9]]])
>>> R = np.array([[5, 10], [-1, 2]])
>>> vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
>>> vi.run()
>>> expected = (40.048625392716815, 33.65371175967546)
>>> all(expected[k] - vi.V[k] < 1e-12 for k in range(len(expected)))
True
>>> vi.policy
(1, 0)
>>> vi.iter
26

>>> import mdptoolbox
>>> import numpy as np
>>> from scipy.sparse import csr_matrix as sparse
>>> P = [None] * 2
>>> P = sparse([[0.5, 0.5],[0.8, 0.2]])
>>> P = sparse([[0, 1],[0.1, 0.9]])
>>> R = np.array([[5, 10], [-1, 2]])
>>> vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9)
>>> vi.run()
>>> expected = (40.048625392716815, 33.65371175967546)
>>> all(expected[k] - vi.V[k] < 1e-12 for k in range(len(expected)))
True
>>> vi.policy
(1, 0)

run()[source]
setSilent()

Set the MDP algorithm to silent mode.

setVerbose()

Set the MDP algorithm to verbose mode.

class mdptoolbox.mdp.ValueIterationGS(transitions, reward, discount, epsilon=0.01, max_iter=10, initial_value=0, skip_check=False)[source]

A discounted MDP solved using the value iteration Gauss-Seidel algorithm.

Parameters: transitions (array) – Transition probability matrices. See the documentation for the MDP class for details. reward (array) – Reward matrices or vectors. See the documentation for the MDP class for details. discount (float) – Discount factor. See the documentation for the MDP class for details. epsilon (float, optional) – Stopping criterion. See the documentation for the MDP class for details. Default: 0.01. max_iter (int, optional) – Maximum number of iterations. See the documentation for the MDP and ValueIteration classes for details. Default: computed. initial_value (array, optional) – The starting value function. Default: a vector of zeros. skip_check (bool) – By default we run a check on the transitions and rewards arguments to make sure they describe a valid MDP. You can set this argument to True in order to skip this check. Attribues (Data) – ————– – policy (tuple) – epsilon-optimal policy iter (int) – number of done iterations time (float) – used CPU time

Notes

In verbose mode, at each iteration, displays the variation of V and the condition which stopped iterations: epsilon-optimum policy found or maximum number of iterations reached.

Examples

>>> import mdptoolbox.example, numpy as np
>>> P, R = mdptoolbox.example.forest()
>>> vigs = mdptoolbox.mdp.ValueIterationGS(P, R, 0.9)
>>> vigs.run()
>>> expected = (25.5833879767579, 28.830654635546928, 32.83065463554693)
>>> all(expected[k] - vigs.V[k] < 1e-12 for k in range(len(expected)))
True
>>> vigs.policy
(0, 0, 0)

run()[source]
setSilent()

Set the MDP algorithm to silent mode.

setVerbose()

Set the MDP algorithm to verbose mode.