Markov Decision Process - International Series in Operations Research & Management Science

• Problem 1:

Given the observation sequence O D fO1O2 OTg and a HMM, how to efficiently compute the probability of the observation sequence?

• Problem 2:

Given the observation sequenceOD fO1O2 OTgand a HMM, how to choose a corresponding state sequence Q D fQ1Q2 QTg which is optimal in a certain context?

• Problem 3:

Given the observation sequenceO D fO1O2 OTg, how to choose the model parameters in a HMM?

For Problem 1, a forward-backward dynamic programming procedure [11] has been formulated to calculate the probability of the observation sequence efficiently.

For Problem 2, we attempt to uncover the hidden part of the model, i.e., to find the “correct” state sequence. In many practical situations, we use an optimality criteria to solve the problem as best as possible. The most widely used criterion is to find a single best state sequence, i.e., maximize the likelihoodP .Qj; O/. This is equivalent to maximizingP .Q; Oj/since

P .Qj; O/D P .Q; Oj/ P .Oj/ :

Viterbi algorithm [204] is a dynamic programming technique for finding this single best state sequence

QD fQ1; Q2; ; QTg for the given observation sequence

OD fO1; O2; ; OTg:

For Problem 3, we attempt to adjust the model parameterssuch thatP .Oj/is maximized by using the Expectation-Maximization (EM) algorithm. For a complete tutorial on HMMs, we refer readers to the paper by Rabiner [175] and the book by MacDonald and Zucchini [155].

1.5 Markov Decision Process

Markov Decision Process (MDP) has been successfully applied in equipment maintenance, inventory control and many other areas in management science [3, 208]. In this section, we will briefly introduce the MDP, but interested readers can also consult the books by Altman [3], Puterman [173] and White [207].

Similar to the case of a Markov chain, the MDP is a system that can move from one distinguished state to any other possible states. In each step, the decision

maker has to take action on a well-defined set of alternatives. This action affects the transition probabilities of the next move and incurs an immediate gain (or loss) and subsequent gain (or loss). The problem that the decision maker faces is to determine a sequence of actions maximizing the overall gain. The process of MDP is summarized as follows:

(i) At timet, a certain stateiof the Markov chain is observed.

(ii) After the observation of the state, an action, let us sayk, is taken from a set of possible decisionsAi. Different states may have different sets of possible actions.

(iii) An immediate gain (or loss)q_i^.k/is then incurred according to the current state iand the actionktaken.

(iv) The transition probabilitiesp_{j i}^.k/are then affected by the actionk.

(v) When the time parametert increases, transition occurs again and the above steps (i)–(iv) repeat.

A policyDis a rule of taking action. It prescribes all the decisions that should be made throughout the process. Given the current state i, the value of an optimal policy v_i.t/ is defined as the total expected gain obtained with t decisions or transitions remaining. For the case of one-period remaining, i.e.t D 1, the value of an optimal policy is given by

v_i.1/Dmax

k2Ai

fq^.k/_i g: (1.10)

Since there is only one-period remained, an action maximizing the immediate gain will be taken. For the case of two periods remaining, we have

v_i.2/Dmax

where˛ is the discount factor. Since the subsequent gain is associated with the transition probabilities which are affected by the action taken, an optimal policy should consider both the immediate and subsequent gain. The model can be easily extended to a more general situation, the process havingntransitions remaining.

v_i.n/Dmax

1.5 Markov Decision Process 39

Table 1.1 A summary of the policy parameters

Statei Alternativek q^.k/i pi1^.k/ p^.k/i2

1 (high volume) 1. (No action) 8 0.4 0.6 2. (Regular Maintenance) 7 0.8 0.2

3. (Fully Upgrade) 5 1 0

2 (low volume) 1. (No action) 4 0.1 0.9

2. (Regular Maintenance) 3 0.4 0.6 3. (Fully Upgrade) 1 0.8 0.2

From the above equation, the subsequent gain of v_i.n/ is defined as the expected value of v_j.n1/. Since the number of transitions remaining is countable or finite, the process is called the discounted finite horizon MDP. For the infinite horizon MDP, the value of an optimal policy can be expressed as

v_i Dmax

k2Ai

:q^.k/_i C˛X

p^.k/_{j i}v_j 9=

;: (1.13)

The finite horizon MDP is a dynamic programming problem and the infinite horizon MDP can be transformed into a linear programming problem. Both of them can be solved easily by using an EXCEL spreadsheet.

Example 1.56. We consider an on-line game company that plans to stay in business for4more years and then it will be closed without any salvage value. Each year, the volume of players only depends on the volume in the last year, and it is classified as either high or low. If a high volume of players occurs, the expected profit for the company will be8 million dollars; but the profit drops to4 million dollars when a low volume of players is encountered. At the end of every year, the profit of this year is collected, and then the company has the option to take certain actions that influence the performance of their service and hence the volume of players in the future may be altered. But some of these actions are costly so they reduce instant profit. To be more specific, the company can choose to: take no action, which costs nothing; perform only regular maintenance to the service system, which costs1 million; or fully upgrade the service system, which costs3million. When the volume of players in the last year was high, it stays in the high state in the coming year with probability0:4 if no action is taken; this probability is 0:8 if only regular maintenance is performed; and the probability rises to1if the system is fully upgraded. When the volume of players in the last year was low, then the probability that the player volume stays low is0:9with no action taken,0:6with regular maintenance, and0:2when the service system is fully upgraded. Assume the discount factor is0:9and that the company experienced a low volume of players last year. Determine the optimal (profit maximizing) strategy for the company. The parameters of this problem can be summarized in Table1.1.

By using the MDP approach, we can compute the following:

With the results from the last equations, we can solve for other values by backward substitution.

thenpi.n/actually keeps track of the optimal policy for every single period. We can summarize all results from the calculations in Table1.2.

1.5 Markov Decision Process 41

Table 1.2 A summary of

results n 1 2 3 4

v₁.n/ 8 13:48 18:15 22:27

v₂.n/ 4 8:04 12:19 16:27

p1.n/ 1 2 2 2

p2.n/ 1 2 2 3

Table 1.3 A summary of

results n 1 2 3 4

v₁.n/ 8 11:36 13:25 14:35

v₂.n/ 4 6:64 8:27 9:26

p1.n/ 1 1 2 2

p2.n/ 1 1 1 1

Since the on-line gaming company started from having a low volume of players (State 2), the optimal policy for the company is as follows: with4more years left, choose Alternative 3 (fully upgrade); then use Alternative 2 (regular maintenance) for two consecutive years; and finally, use Alternative 1 (no action) when there is only1year left.

Note that the optimal policy may vary depending on the value of the discount factor. For instance, if in this example, we have a discount factor of0:6, then we have different results as summarized in Table1.3. If the company starts with a low volume of players, the optimal policy is to stay with Alternative 1 (no action). We leave it as an exercise for the reader to device the results themselves.

1.5.1 Stationary Policy

A stationary policy is a policy where the decision depends only on the state the system is in and is independent ofn. For instance, a stationary policyDprescribes the actionD.i/when the current state isi. DefineDN as the associated one-step-removed policy, then the value of policy w_i.D/is defined as

w_i.D/Dq_i^D.i/C˛X

p^D.i/_{j i} w_j.D/:N (1.15)

Given a Markov decision process with an infinite horizon and a discount factor˛, 0 < ˛ < 1, choose, for eachi, an alternativeki such that

maxk2Ai

:q_i^.k/C˛X

p_{j i}^.k/v_j 9=

;Dq^.k_i ⁱ^/C˛X

p_{j i}^.kⁱ^/v_j:

Define the stationary policyDbyD.i/Dki. Then for eachi, w_i.D/Dv_i, i.e. the stationary policy is an optimal policy.

Table 1.4 A summary of

results Policy D Values

D.1/ D.2/ w₁.D/ w₂.D/

1 1 50.41 44.93

1 2 53.00 48.00

1 3 52.21 47.06

2 1 55.41 47.30

2 2 58.75 52.50

2 3 59.20 53.20

3 1 50.00 44.74

3 2 50.00 45.65

3 3 50.00 45.12

Example 1.57. Determine the optimal policy and the values for the Markov deci-sion process in Example1.56, assuming the process has an infinite horizon and the discount factor remains equal to0:9.

For a stationary policyD(withD.1/Dk1,D.2/Dk2), since we have (

w₁.D/Dq₁^.k¹^/C˛Œp₁₁^.k¹^/w₁.D/Cp₂₁^.k²^/w₂.D/

w₂.D/Dq₂^.k²^/C˛Œp₁₂^.k¹^/w₁.D/Cp₁₁^.k²^/w₂.D/;

hence Œw1.D/I w₂.D/ can be solved for every stationary policy. Results are summarized in the following table.

From the above, the optimal values are v₁ D w₁.D/ D 59:2, v2 D w₂.D/ D 53:2; the optimal stationary policy is to choose Alternative 2 in state 1 and to choose Alternative 3 in state 2 (Table1.4).

Dans le document International Series in Operations Research & Management Science (Page 54-59)