POMDP(Partially Observable Markov Decision Process,部分可观测马尔可夫决策过程)可以表示为:
M = ( S , A , O , T , Z ) \mathcal{M}=(\mathcal{S,A,O,T,Z})
M = ( S , A , O , T , Z )
相比于MDP,POMDP引入了对环境状态的不完全观测。在POMDP中,智能体无法直接观察到环境的真实状态,而是通过观测获得部分信息。因此,智能体需要维护一个信念状态(belief state) ,即对环境状态的概率分布,以便在不确定性下进行决策。观测Z \mathcal{Z} Z 是所有观测量z z z 构成的集合,观测函数O \mathcal{O} O 得到观测z z z 的概率分布:
O ( z ∣ s ′ , a ) = P ( z ∣ s ′ , a ) \mathcal{O}(z|s',a)=P(z|s',a)
O ( z ∣ s ′ , a ) = P ( z ∣ s ′ , a )
MDP问题的求解
MDP问题的求解目标是找到最优策略对应的状态值函数和动作值函数。
策略的定义为:
π ( a ∣ s ) \pi(a|s)
π ( a ∣ s )
代表策略在状态s s s 下选择动作a a a 的概率。对应的状态值函数和动作值函数分别为:
V π ( s ) = E π [ G t ∣ s ] V^{\pi}(s)=E_{\pi}[G_t|s]
V π ( s ) = E π [ G t ∣ s ]
Q π ( s , a ) = E π [ G t ∣ s , a ] Q^{\pi}(s,a)=E_{\pi}[G_t|s,a]
Q π ( s , a ) = E π [ G t ∣ s , a ]
其中,E E E 代表期望,G t G_t G t 表示从时间步t t t 开始的累积折扣奖励,γ \gamma γ 为折扣因子:
G t = ∑ k = 0 ∞ γ k r t + k + 1 G_t=\sum_{k=0}^{\infty}\gamma^k r_{t+k+1}
G t = k = 0 ∑ ∞ γ k r t + k + 1
不同的策略会导致不同的状态值函数和动作值函数。最优策略π ∗ \pi^* π ∗ 对应的最优状态值函数和动作值函数分别为:
V ∗ ( s ) = max π [ R ( s , a ) + γ ∑ s ′ Pr ( s ′ ∣ s , a ) V ∗ ( s ′ ) ] V^*(s)=\max_{\pi} [R(s,a)+\gamma\sum_{s'}\Pr(s'|s,a)V^*(s')]
V ∗ ( s ) = π max [ R ( s , a ) + γ s ′ ∑ Pr ( s ′ ∣ s , a ) V ∗ ( s ′ ) ]
Q ∗ ( s , a ) = R ( s , a ) + γ ∑ s ′ Pr ( s ′ ∣ s , a ) max a ′ Q ∗ ( s ′ , a ′ ) Q^*(s,a)=R(s,a)+\gamma\sum_{s'}\Pr(s'|s,a)\max_{a'}Q^*(s',a')
Q ∗ ( s , a ) = R ( s , a ) + γ s ′ ∑ Pr ( s ′ ∣ s , a ) a ′ max Q ∗ ( s ′ , a ′ )
贝尔曼方程为求解最优策略提供了递归关系。
贝尔曼期望方程
贝尔曼方程用于计算给定策略π \pi π 时价值函数在策略指引下所采轨迹上的期望。考虑如下随机轨迹:
S t → A t R t + 1 , S t + 1 → A t + 1 R t + 2 , S t + 2 → A t + 2 R t + 3 S_t \xrightarrow{A_t} R_{t+1}, S_{t+1} \xrightarrow{A_{t+1}} R_{t+2}, S_{t+2} \xrightarrow{A_{t+2}} R_{t+3}
S t A t R t + 1 , S t + 1 A t + 1 R t + 2 , S t + 2 A t + 2 R t + 3
累积汇报G t G_t G t 定义为:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … , = R t + 1 + γ ( R t + 2 + γ R t + 3 + … ) , = R t + 1 + γ G t + 1 . \begin{aligned}
G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots, \\
&= R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + \dots), \\
&= R_{t+1} + \gamma G_{t+1}.
\end{aligned}
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … , = R t + 1 + γ ( R t + 2 + γ R t + 3 + … ) , = R t + 1 + γ G t + 1 .
状态值函数的贝尔曼方程为:
v π ( s ) ≐ E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] , ∀ s ∈ S . \begin{aligned}
v_\pi(s) &\doteq \mathbb{E}_\pi[G_t \mid S_t = s] \\
&= \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} \mid S_t = s] \\
&= \sum_a \pi(a|s) \sum_{s',r} p(s', r|s, a)\left[r + \gamma v_\pi(s')\right], \quad \forall s \in \mathcal{S}.
\end{aligned}
v π ( s ) ≐ E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ G t + 1 ∣ S t = s ] = a ∑ π ( a ∣ s ) s ′ , r ∑ p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] , ∀ s ∈ S .
推导:
v π ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] . \begin{aligned}
v_\pi(s) &= \mathbb{E}[G_t \mid S_t = s] \\
&= \mathbb{E}[R_{t+1} + \gamma G_{t+1} \mid S_t = s] \\
&= \mathbb{E}[R_{t+1} \mid S_t = s] + \gamma \mathbb{E}[G_{t+1} \mid S_t = s].
\end{aligned}
v π ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] .
考虑第一部分E [ R t + 1 ∣ S t = s ] \mathbb{E}[R_{t+1} \mid S_t = s] E [ R t + 1 ∣ S t = s ] ,应用全概率公式:
P ( A ) = ∑ i P ( A ∣ B i ) P ( B i ) , P(A)=\sum_{i} P(A|B_i)P(B_i),
P ( A ) = i ∑ P ( A ∣ B i ) P ( B i ) ,
得到:
E [ R t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r . \begin{aligned}
\mathbb{E}[R_{t+1} \mid S_t = s] &= \sum_a \pi(a|s) \mathbb{E}[R_{t+1} \mid S_t = s, A_t = a] \\
&= \sum_a \pi(a|s) \sum_r p(r|s,a) r.
\end{aligned}
E [ R t + 1 ∣ S t = s ] = a ∑ π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) r .
考虑第二部分E [ G t + 1 ∣ S t = s ] \mathbb{E}[G_{t+1} \mid S_t = s] E [ G t + 1 ∣ S t = s ] ,应用全概率公式和马尔可夫性质(在给定当前状态的情况下,未来状态的条件概率分布仅依赖于当前状态,而与过去状态无关):
E [ G t + 1 ∣ S t = s ] = ∑ s ′ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ v π ( s ′ ) p ( s ′ ∣ s ) = ∑ s ′ v π ( s ′ ) ∑ a p ( s ′ ∣ s , a ) π ( a ∣ s ) . \begin{aligned}
\mathbb{E}[G_{t+1} \mid S_t = s] &= \sum_{s'} \mathbb{E}[G_{t+1} \mid S_t = s, S_{t+1} = s'] p(s' \mid s) \\
&= \sum_{s'} \mathbb{E}[G_{t+1} \mid S_{t+1} = s'] p(s' \mid s) \\
&= \sum_{s'} v_\pi(s') p(s' \mid s) \\
&= \sum_{s'} v_\pi(s') \sum_a p(s' \mid s, a) \pi(a \mid s).
\end{aligned}
E [ G t + 1 ∣ S t = s ] = s ′ ∑ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = s ′ ∑ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) = s ′ ∑ v π ( s ′ ) p ( s ′ ∣ s ) = s ′ ∑ v π ( s ′ ) a ∑ p ( s ′ ∣ s , a ) π ( a ∣ s ) .
合并两个部分:
v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] , = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r ⏟ mean of immediate rewards + γ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ⏟ mean of future rewards , = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] , ∀ s ∈ S = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] , ∀ s ∈ S . \begin{aligned}
v_\pi(s) &= \mathbb{E}[R_{t+1} \mid S_t = s] + \gamma \mathbb{E}[G_{t+1} \mid S_t = s], \\
&= \underbrace{\sum_a \pi(a|s) \sum_r p(r|s,a) r}_{\text{mean of immediate rewards}} + \gamma \underbrace{\sum_a \pi(a|s) \sum_{s'} p(s'|s,a) v_\pi(s')}_{\text{mean of future rewards}}, \\
&= \sum_a \pi(a|s) \left[ \sum_r p(r|s,a) r + \gamma \sum_{s'} p(s'|s,a) v_\pi(s') \right], \quad \forall s \in \mathcal{S} \\
&= \sum_a \pi(a|s) \sum_{s',r} p(s', r|s,a) \left[ r + \gamma v_\pi(s') \right], \quad \forall s \in \mathcal{S}.
\end{aligned}
v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] , = mean of immediate rewards a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) r + γ mean of future rewards a ∑ π ( a ∣ s ) s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) , = a ∑ π ( a ∣ s ) [ r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ] , ∀ s ∈ S = a ∑ π ( a ∣ s ) s ′ , r ∑ p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] , ∀ s ∈ S .
策略迭代算法
给定策略π \pi π ,利用贝尔曼期望方程,对状态值函数进行更新:
V k + 1 ( s ) = ∑ a π ( a ∣ s ) [ R ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) V k ( s ′ ) ] V^{k+1}(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^k(s') \right]
V k + 1 ( s ) = a ∑ π ( a ∣ s ) [ R ( s , a ) + γ s ′ ∑ P ( s ′ ∣ s , a ) V k ( s ′ ) ]
也就是使用第k k k 轮的值函数,更新第k + 1 k+1 k + 1 轮的值函数。重复该过程,直到值函数收敛至V π V^\pi V π 。
当拿到V π ( s ) V^\pi(s) V π ( s ) 后,可根据状态值函数对原有策略进行改进。也就是在每一个状态下都选择最优的动作:
π ′ ( s ) = arg max a Q π ( s , a ) = arg max a { R ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) V π ( s ′ ) } \pi'(s) = \arg\max_a Q^\pi(s,a) = \arg\max_a \left\{ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \right\}
π ′ ( s ) = arg a max Q π ( s , a ) = arg a max { R ( s , a ) + γ s ′ ∑ P ( s ′ ∣ s , a ) V π ( s ′ ) }
值迭代算法
价值迭代算法单独维护一个状态值函数,不维护策略,并且直接利用贝尔曼最优方程进行状态价值函数的更新:
V k + 1 ( s ) = max a [ R ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) V k ( s ′ ) ] V^{k+1}(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^k(s') \right]
V k + 1 ( s ) = a max [ R ( s , a ) + γ s ′ ∑ P ( s ′ ∣ s , a ) V k ( s ′ ) ]
价值迭代算法就是直接利用上述方程对状态值函数进行迭代更新,直到收敛至最优状态值函数∥ V k + 1 − V k ∥ ≤ Δ \left\| V^{k+1} - V^k \right\| \leq \Delta ∥ ∥ ∥ V k + 1 − V k ∥ ∥ ∥ ≤ Δ 。
POMDP问题的求解
MDP问题的求解结果是得到了值函数V ∗ ( s ) V^*(s) V ∗ ( s ) 或Q ∗ ( s , a ) Q^*(s,a) Q ∗ ( s , a ) ,相当于两个大表格,但是POMDP中智能体无法直接观测到环境状态s s s ,因此无法直接使用MDP的求解方法,同时最优决策依赖的是历史信息的统计汇总,而不是当前的观测。在POMDP中,智能体需要维护一个信念状态(Belief State)b ( s ) b(s) b ( s ) ,表示在当前观测z z z 下环境状态为s s s 的概率分布:
b t ( s ) = Pr ( s t = s ∣ z 1 : t , a 1 : t − 1 ) b_t(s)=\Pr(s_t=s|z_{1:t},a_{1:t-1})
b t ( s ) = Pr ( s t = s ∣ z 1 : t , a 1 : t − 1 )
b ( s ) b(s) b ( s ) 存在无限多情况,不能通过表格来描述,引入信念空间后,POMDP的求解目标变成了计算信念状态下的最优值函数V ∗ ( b ) V^*(b) V ∗ ( b ) 或动作值函数Q ∗ ( b , a ) Q^*(b,a) Q ∗ ( b , a ) ,即在信念状态b b b 下采取最优策略所能获得的最大期望累积奖励。
信念状态的更新
根据马尔可夫性质,根据上一时刻的信念状态b t − 1 b_{t-1} b t − 1 、采取的动作a t − 1 a_{t-1} a t − 1 和当前观测z t z_t z t ,可以计算当前时刻的信念状态b t b_t b t :
b t ( s ′ ) = τ ( b t − 1 , a t − 1 , z t ) ( s ′ ) = 1 Pr ( z t ∣ b t − 1 , a t − 1 ) Pr ( z t ∣ a t − 1 , s ′ ) ∑ s ∈ S Pr ( s ′ ∣ s , a t − 1 ) b t − 1 ( s ) Pr ( z t ∣ b t − 1 , a t − 1 ) = ∑ s ′ ∈ S Pr ( z t ∣ a t − 1 , s ′ ) ∑ s ∈ S Pr ( s ′ ∣ s , a t − 1 ) b t − 1 ( s ) \begin{aligned}
b_t(s') &= \tau(b_{t-1}, a_{t-1}, z_t)(s') \\
&= \frac{1}{\Pr(z_t|b_{t-1}, a_{t-1})} \Pr(z_t|a_{t-1}, s') \sum_{s \in S} \Pr(s'|s, a_{t-1}) b_{t-1}(s)
\end{aligned} \\
\Pr(z_t | b_{t-1}, a_{t-1}) = \sum_{s' \in S} \Pr(z_t | a_{t-1}, s') \sum_{s \in S} \Pr(s' | s, a_{t-1}) b_{t-1}(s)
b t ( s ′ ) = τ ( b t − 1 , a t − 1 , z t ) ( s ′ ) = Pr ( z t ∣ b t − 1 , a t − 1 ) 1 Pr ( z t ∣ a t − 1 , s ′ ) s ∈ S ∑ Pr ( s ′ ∣ s , a t − 1 ) b t − 1 ( s ) Pr ( z t ∣ b t − 1 , a t − 1 ) = s ′ ∈ S ∑ Pr ( z t ∣ a t − 1 , s ′ ) s ∈ S ∑ Pr ( s ′ ∣ s , a t − 1 ) b t − 1 ( s )
其中,Pr ( z t ∣ a t − 1 , s ′ ) \Pr(z_t|a_{t-1}, s') Pr ( z t ∣ a t − 1 , s ′ ) 是观测函数O \mathcal{O} O ,Pr ( s ′ ∣ s , a t − 1 ) \Pr(s'|s, a_{t-1}) Pr ( s ′ ∣ s , a t − 1 ) 是转移函数T \mathcal{T} T 。
Belief MDP
通过引入信念状态,POMDP可以转换为一个等价的MDP,称为Belief MDP。在Belief MDP中,状态空间是信念空间B \mathcal{B} B ,动作空间仍然是A \mathcal{A} A 。转移函数τ \tau τ 和奖励函数R R R 需要根据信念状态进行定义。
奖励函数定义为:
R B ( b , a ) = ∑ s ∈ S b ( s ) R ( s , a ) R_B(b,a) = \sum_{s \in S} b(s) R(s,a)
R B ( b , a ) = s ∈ S ∑ b ( s ) R ( s , a )
Belief MDP的状态值函数和动作值函数满足的贝尔曼方程为:
V ∗ ( b ) = max a ∈ A [ R B ( b , a ) + γ ∑ z ∈ Z Pr ( z ∣ b , a ) V ∗ ( τ ( b , a , z ) ) ] Q ∗ ( b , a ) = R B ( b , a ) + γ ∑ z ∈ Z Pr ( z ∣ b , a ) V ∗ ( τ ( b , a , z ) ) V^*(b) = \max_{a \in A} \left[ R_B(b, a) + \gamma \sum_{z \in Z} \Pr(z | b, a) V^*(\tau(b, a, z)) \right] \\
Q^*(b, a) = R_B(b, a) + \gamma \sum_{z \in Z} \Pr(z | b, a) V^*(\tau(b, a, z))
V ∗ ( b ) = a ∈ A max [ R B ( b , a ) + γ z ∈ Z ∑ Pr ( z ∣ b , a ) V ∗ ( τ ( b , a , z ) ) ] Q ∗ ( b , a ) = R B ( b , a ) + γ z ∈ Z ∑ Pr ( z ∣ b , a ) V ∗ ( τ ( b , a , z ) )
POMCP(Partially Observable Monte Carlo Planning)