【DOCX】最优控制2017 - 资源下载

编辑：

qksr

2019-07-17

最优控制2017 MDP导论高峰、吴江 2017/05/23 Reinforcement learning 高峰、吴江倒立摆一般情况下, 倒立摆系统有四个状态变量 x : 小车在轨道上的位置 q : 倒立摆偏离垂直方向的角度 v: 小车的运动速度 w:倒立摆的角速度倒立摆对环境的定义: 状态s: 动作a:{左,右} 回报r:{0, -1} Q学习随机设定初始状态重复仿真实验,确定各状态-动作对的Q值当实验超过规定步数或实验失败次数时,重置初始状态高峰、吴江 Index Markov Chains Markov Decision Processes Basic Optimization Techniques Advanced Introduction to MDP Conclusion Stochastic Process Quick definition: A Random Process Often viewed as a collection of indexed random variables Useful to us: Set of states with probabilities of being in those states indexed over time We'll deal with discrete stochastic processes Stochastic Process Example Classic: Random Walk Start at state X0 at time t0 At time ti, move a step Zi where P(Zi = -1) = p and P(Zi = 1) =

1 - p At time ti, state Xi = X0 + Z1 +…+ Zi http://en.

wikipedia.org/wiki/Image:Random_Walk_example.png Markov Property Also thought of as the "memoryless" property A stochastic process is said to have the Markov property if the probability of state Xn+1 having any given value depends only upon state Xn Very much depends on description of states Markov Chains - Markov Chain Definition A stochastic process {Xt, t = 0, 1, 2,…} is a finite-state Markov chain if it has the following properties: A finite number of states The Markov property Stationary transition properties, pij A set of initial probabilities, P(X0=i), for all states i Discrete-time stochastic process with the Markov property Weather: raining today 40% rain tomorrow 60% no rain tomorrow not raining today 20% rain tomorrow 80% no rain tomorrow Markov Chains Simple Example Stochastic FSM: Weather: raining today 40% rain tomorrow 60% no rain tomorrow not raining today 20% rain tomorrow 80% no rain tomorrow Markov ChainsSimple Example Stochastic matrix: Rows sum up to

1 The transition matrix: A industrial example: Google PageRank 高峰、吴江高峰、吴江 Index Markov Chains Markov Decision Processes Basic Optimization Techniques Advanced Introduction to MDP Conclusion Markov Decision Process (MDP) Discrete time stochastic control process Extension of Markov chains Differences: Addition of actions (choice) Addition of rewards (motivation) If the actions are fixed, an MDP reduces to a Markov chain MDP Framework(1/2) S : state space A : action space Pr(st+1 = s' | st , at ) =Pr(st+1 = s' | s0,…st , a0,…at ) [Markov property] R(s) : immediate reward at state s 高峰、吴江 MDP Framework(2/2) Find a policy: π : S → A Maximize Myopic: E[rt | π, st] for all s Finite horizon: E[Σkt=0 rt | π, s0] C Non-stationary policy: depends on time Infinite horizon: E[Σ∞t=0 rt | π, s0] E[Σ∞t=0 γtrt | π, s0] C

0 < γ <

1 is discount factor C Optimal policy is stationary 高峰、吴江 Value of a Policy How good is a policy π? How do we measure "accumulated" reward? Value function V: S →? associates value with each state (or each state and time for non-stationary π) Vπ(s) denotes value of policy at state s Depends on immediate reward, but also what you achieve subsequently by following π An optimal policy is one that is no worse than any other policy at any state The goal of MDP planning is to compute an optimal policy (method depends on how we define value) Simple MDP Example Recycling MDP Robot Can search for trashcan, wait for someone to bring a trashcan, or go home and recharge battery Has two energy levels C high and low Searching runs down battery, waiting does not, and a depleted battery has a very low reward Transition Probabilities s = st s' = st+1 a = at Pass' Rass' high high search α Rsearch high low search

注：以上内容是本站开源项目的机器提供的预览内容，更完整和更好的阅读体验请直接免费下载资源后阅读

下载（注：源文件不在本站服务器，都将跳转到源网站下载）

备用下载

下一篇: 养津堂
上一篇: 业务模式：收购,出售二手手机,更可以换机（是指可用手机...

DOCX《最优控制2017》