编辑: qksr 2019-07-17
最优控制2017 MDP导论 高峰、吴江 2017/05/23 Reinforcement learning 高峰、吴江 倒立摆 一般情况下, 倒立摆系统有四个状态变量 x : 小车在轨道上的位置 q : 倒立摆偏离垂直方向的角度 v: 小车的运动速度 w:倒立摆的角速度 倒立摆 对环境的定义: 状态s: 动作a:{左,右} 回报r:{0, -1} Q学习 随机设定初始状态 重复仿真实验,确定各状态-动作对的Q值 当实验超过规定步数或实验失败次数时,重置初始状态 高峰、吴江 Index Markov Chains Markov Decision Processes Basic Optimization Techniques Advanced Introduction to MDP Conclusion Stochastic Process Quick definition: A Random Process Often viewed as a collection of indexed random variables Useful to us: Set of states with probabilities of being in those states indexed over time We'll deal with discrete stochastic processes Stochastic Process Example Classic: Random Walk Start at state X0 at time t0 At time ti, move a step Zi where P(Zi = -1) = p and P(Zi = 1) =

1 - p At time ti, state Xi = X0 + Z1 +…+ Zi http://en.

wikipedia.org/wiki/Image:Random_Walk_example.png Markov Property Also thought of as the "memoryless" property A stochastic process is said to have the Markov property if the probability of state Xn+1 having any given value depends only upon state Xn Very much depends on description of states Markov Chains - Markov Chain Definition A stochastic process {Xt, t = 0, 1, 2,…} is a finite-state Markov chain if it has the following properties: A finite number of states The Markov property Stationary transition properties, pij A set of initial probabilities, P(X0=i), for all states i Discrete-time stochastic process with the Markov property Weather: raining today 40% rain tomorrow 60% no rain tomorrow not raining today 20% rain tomorrow 80% no rain tomorrow Markov Chains Simple Example Stochastic FSM: Weather: raining today 40% rain tomorrow 60% no rain tomorrow not raining today 20% rain tomorrow 80% no rain tomorrow Markov ChainsSimple Example Stochastic matrix: Rows sum up to

1 The transition matrix: A industrial example: Google PageRank 高峰、吴江 高峰、吴江 Index Markov Chains Markov Decision Processes Basic Optimization Techniques Advanced Introduction to MDP Conclusion Markov Decision Process (MDP) Discrete time stochastic control process Extension of Markov chains Differences: Addition of actions (choice) Addition of rewards (motivation) If the actions are fixed, an MDP reduces to a Markov chain MDP Framework(1/2) S : state space A : action space Pr(st+1 = s' | st , at ) =Pr(st+1 = s' | s0,…st , a0,…at ) [Markov property] R(s) : immediate reward at state s 高峰、吴江 MDP Framework(2/2) Find a policy: π : S → A Maximize Myopic: E[rt | π, st] for all s Finite horizon: E[Σkt=0 rt | π, s0] C Non-stationary policy: depends on time Infinite horizon: E[Σ∞t=0 rt | π, s0] E[Σ∞t=0 γtrt | π, s0] C

0 < γ <

1 is discount factor C Optimal policy is stationary 高峰、吴江 Value of a Policy How good is a policy π? How do we measure "accumulated" reward? Value function V: S →? associates value with each state (or each state and time for non-stationary π) Vπ(s) denotes value of policy at state s Depends on immediate reward, but also what you achieve subsequently by following π An optimal policy is one that is no worse than any other policy at any state The goal of MDP planning is to compute an optimal policy (method depends on how we define value) Simple MDP Example Recycling MDP Robot Can search for trashcan, wait for someone to bring a trashcan, or go home and recharge battery Has two energy levels C high and low Searching runs down battery, waiting does not, and a depleted battery has a very low reward Transition Probabilities s = st s' = st+1 a = at Pass' Rass' high high search α Rsearch high low search

下载(注:源文件不在本站服务器,都将跳转到源网站下载)
备用下载
发帖评论
相关话题
发布一个新话题