编辑: qksr | 2019-07-17 |
1 - p At time ti, state Xi = X0 + Z1 +…+ Zi http://en.
wikipedia.org/wiki/Image:Random_Walk_example.png Markov Property Also thought of as the "memoryless" property A stochastic process is said to have the Markov property if the probability of state Xn+1 having any given value depends only upon state Xn Very much depends on description of states Markov Chains - Markov Chain Definition A stochastic process {Xt, t = 0, 1, 2,…} is a finite-state Markov chain if it has the following properties: A finite number of states The Markov property Stationary transition properties, pij A set of initial probabilities, P(X0=i), for all states i Discrete-time stochastic process with the Markov property Weather: raining today 40% rain tomorrow 60% no rain tomorrow not raining today 20% rain tomorrow 80% no rain tomorrow Markov Chains Simple Example Stochastic FSM: Weather: raining today 40% rain tomorrow 60% no rain tomorrow not raining today 20% rain tomorrow 80% no rain tomorrow Markov ChainsSimple Example Stochastic matrix: Rows sum up to
1 The transition matrix: A industrial example: Google PageRank 高峰、吴江 高峰、吴江 Index Markov Chains Markov Decision Processes Basic Optimization Techniques Advanced Introduction to MDP Conclusion Markov Decision Process (MDP) Discrete time stochastic control process Extension of Markov chains Differences: Addition of actions (choice) Addition of rewards (motivation) If the actions are fixed, an MDP reduces to a Markov chain MDP Framework(1/2) S : state space A : action space Pr(st+1 = s' | st , at ) =Pr(st+1 = s' | s0,…st , a0,…at ) [Markov property] R(s) : immediate reward at state s 高峰、吴江 MDP Framework(2/2) Find a policy: π : S → A Maximize Myopic: E[rt | π, st] for all s Finite horizon: E[Σkt=0 rt | π, s0] C Non-stationary policy: depends on time Infinite horizon: E[Σ∞t=0 rt | π, s0] E[Σ∞t=0 γtrt | π, s0] C
0 < γ <
1 is discount factor C Optimal policy is stationary 高峰、吴江 Value of a Policy How good is a policy π? How do we measure "accumulated" reward? Value function V: S →? associates value with each state (or each state and time for non-stationary π) Vπ(s) denotes value of policy at state s Depends on immediate reward, but also what you achieve subsequently by following π An optimal policy is one that is no worse than any other policy at any state The goal of MDP planning is to compute an optimal policy (method depends on how we define value) Simple MDP Example Recycling MDP Robot Can search for trashcan, wait for someone to bring a trashcan, or go home and recharge battery Has two energy levels C high and low Searching runs down battery, waiting does not, and a depleted battery has a very low reward Transition Probabilities s = st s' = st+1 a = at Pass' Rass' high high search α Rsearch high low search