编辑: qksr | 2019-07-17 |
1 - α Rsearch low high search
1 - β -3 low low search β Rsearch high high wait
1 Rwait high low wait
0 Rwait low high wait
0 Rwait low low wait
1 Rwait low high recharge
1 0 low low recharge
0 0 Transition Graph state node action node Solution to an MDP = Policy π Gives the action to take from a given state regardless of history Two arrays indexed by state V is the value function, namely the discounted sum of rewards on average from following a policy π is an array of actions to be taken in each state (Policy) V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')
2 basic steps 高峰、吴江 Index Markov Chains Markov Decision Processes Basic Optimization Techniques Advanced Introduction to MDP Conclusion Markov Decision Processes: A Survey /68 Optimization Techniques in General Markov Decision Processes Value Iteration Exhaustive Enumeration Policy Iteration Linear Programming Lagrangian Relaxation Value Iteration 高峰、吴江 Value Iteration: Finite Horizon Case Markov property allows exploitation of DP principle for optimal policy construction no need to enumerate |A|Tn possible policies Value Iteration Vk is optimal k-stage-to-go value function Π*(s,k) is optimal k-stage-to-go policy Bellman equation Value Iteration Why So Interesting? If the transition probabilities are known, this becomes a straightforward computational problem, however… If the transition probabilities are unknown, then this is a problem for reinforcement learning. Policy Evaluation Value equation for fixed policy How can we compute the value function for a policy? we are given R and Pr simple linear system with n variables (each variables is value of a state) and n constraints (one value equation for each state) Use linear algebra (e.g. matrix inverse) Policy Iteration Given fixed policy, can compute its value exactly: Policy iteration exploits this: iterates steps of policy evaluation and policy improvement 1. Choose a random policy π 2. Loop: (a) Evaluate Vπ (b) For each s in S, set (c) Replace π with π' Until no improving action possible at any state Policy improvement Policy Iteration Notes Each step of policy iteration is guaranteed to strictly improve the policy at some state when improvement is possible Convergence assured (Howard) intuitively: no local maxima in value space, and each policy must improve value;
since finite number of policies, will converge to optimal policy Gives exact value of optimal policy Value Iteration vs. Policy Iteration Which is faster? VI or PI It depends on the problem VI takes more iterations than PI, but PI requires more time on each iteration PI must perform policy evaluation on each step which involves solving a linear system Complexity: There are at most exp(n) policies, so PI is no worse than exponential time in number of states Empirically O(n) iterations are required Still no polynomial bound on the number of PI iterations (open problem)! 高峰、吴江 Index Markov Chains Markov Decision Processes Basic Optimization Techniques Advanced Introduction to MDP Conclusion Curse of dimensionality 高峰、吴江 Reservoir network schedule state space grows exponentially in the number of points with the dimension of the state vector Solution Approximate dynamic programming Perturbation analysis Action aggregation State aggregation …. 高峰、吴江 Approximate dynamic programming Heuristic Cost-to-Go Approximation 高峰、吴江 Perturbation analysis Extract the gradient information from one simulation 高峰、吴江 Conclusion? 高峰、吴江