Policy Optimization

发表于 2023-12-12 更新于 2023-12-18 分类于 Research 阅读次数： Waline：本文字数： 13k 阅读时长 ≈ 12 分钟

本文主要对基于策略优化定理的强化学习算法及相关变体进行一个梳理和总结。

Preliminaries

一个马尔可夫决策过程（Markov decision process, MDP）可以由五元组 \(<\mathcal{S},\mathcal{A},\mathcal{R},\mathcal{T},\gamma,\rho_0>\) 定义，其中：

\(\mathcal{S}\) 表示状态空间，是由状态构成的集合
\(\mathcal{A}\) 表示动作空间，是由动作构成的集合
\(\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow \mathbb{R}\) 表示奖励函数，\(\mathcal{R} (s,a)\) 表示在状态 \(s\) 下执行动作 \(a\) 获得的奖励
\(\mathcal{T}: \mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]\) 表示状态转移概率函数，\(\mathcal{T}(s'|s,a)\) 表示在状态 \(s\) 下执行动作 \(a\) 到达状态\(s'\) 的概率
\(\gamma\) 表示折扣因子
\(\rho_0:\mathcal{S}\rightarrow[0,1]\) 表示状态初始分布

Agent 的决策过程由一个随机策略 \(\pi:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]\) 表示，\(\pi(a|s)\) 表示 agent 在状态 \(s\) 下选择动作 \(a\) 的概率。基于策略 \(\pi\) 可以定义出状态价值函数： \[ \begin{equation} \label{v} V_\pi(s)=\mathbb{E}\left[\sum_{t=0}^\infty \gamma^t\mathcal{R}(s_t,a_t)|\pi,s\right] \end{equation} \] 状态动作价值函数： \[ \begin{equation} \label{q} Q_\pi(s,a) = \mathcal{R}(s,a)+\gamma\mathbb{E}_{s'\sim\mathcal{T}(\cdot|s,a)}[V_\pi(s')] \end{equation} \]

优势函数： \[ \begin{equation} \label{adv} A_\pi(s,a) = Q_\pi(s,a)-V_\pi(s) \end{equation} \] Agent 的优化目标通常为最大化在初始状态分布上的折扣汇报，即： \[ \begin{equation} \label{eta} \eta(\pi) = \mathbb{E}_{s\sim\rho_0}[V_\pi(s)] \end{equation} \] 状态的折扣访问频率表示为： \[ \begin{equation} \label{rho-pi} \rho_\pi(s)=(P(s_0=s)+\gamma P(s_1=s)+\gamma^2 P(s_2=s)+...) \end{equation} \]

策略梯度定理： \[ \begin{equation} \label{policy-gradient} \nabla \eta(\pi) = \sum_{s,a}\rho_\pi(s)\nabla \pi(a|s)Q_\pi(s,a) \end{equation} \]

Approximately Optimal Approximate Reinforcement Learning

Kakade, S. & Langford, J. Approximately Optimal Approximate Reinforcement Learning. in Proceedings of the Nineteenth International Conference on Machine Learning 267–274 (Morgan Kaufmann Publishers Inc., 2002).

本文提出了三个想要回答的问题：

是否存在性能度量可以保证每一步更新都有提升？
验证某个更新提升该性能度量有多么困难？
在一定合理的次数的策略更新后，策略性能能达到什么样的水平？

考虑如下保守策略更新规则： \[ \begin{equation} \pi_{new}(a|s)=(1-\alpha)\pi(a|s)+\alpha\pi'(a|s) \end{equation} \] 其中，\(\alpha\in[0,1]\)，如果要在 \(\alpha=1\) 时保证策略提升，那么 \(\pi'\) 需要在每个状态下都采取比 \(\pi\) 更好的动作。考虑 \(0<\alpha <1\) 时，策略提升只需要 \(\pi'\) 在大部分而非全部状态选择更好的动作。定义策略优势 \(\mathbb{A}_{\pi,\rho_0}(\pi')\) 为 \[ \mathbb{A}_{\pi,\rho_0}(\pi')=\mathbb{E}_{s\sim\rho_\pi}\left[\mathbb{E}_{a\sim\pi'(\cdot|s)}[A_\pi(s,a)]\right] \] 该策略优势函数衡量了在状态初始分布为 \(\rho_0\)，行动策略为 \(\pi\) 时，\(\pi'\) 选择具有更大优势动作的程度。对于 \(\alpha=0\) 时，显然有 \(\frac{\partial \eta}{\partial \alpha}\big|_{\alpha=0}=\frac{1}{1-\gamma}\mathbb{A}_{\pi,\rho_0}\)

to be continue......

TRPO

Schulman, J., Levine, S., Moritz, P., Jordan, M. & Abbeel, P. Trust region policy optimization. in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 1889–1897 (JMLR.org, 2015).

策略优势定理： \[ \begin{equation} \label{policy-improve} \eta(\tilde\pi)=\eta(\pi)+\mathbb{E}_{s_0\sim\rho_0,a\sim\tilde\pi(\cdot|s),s'\sim\mathcal{T}(\cdot|s,a)}\left[\sum_{t=1}^{\infty}\gamma^tA_\pi(s_t,a_t)\right] \end{equation} \] 证明如下（为方便书写，省略期望下标中的 \(s_0\sim\rho_0,s'\sim\mathcal{T}(\cdot|s,a)\)，\(a\sim\pi(\cdot|s)\) 简写为 \(a\sim\pi\)）： \[ \begin{align} \label{proof-policy-improve} &\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^tA_\pi(s_t,a_t)\right]\\ \notag &=\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t\left[Q_\pi(s_t,a_t)-V_\pi(s_t)\right]\right]\\ \notag &= \mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t\left[r_t+\gamma V_\pi(s_{t+1})-V_\pi(s_t)\right]\right]\\ \notag &= \mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t r_t\right]+\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^{t+1} V_\pi(s_{t+1})\right]-\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t V_\pi(s_t)\right]\\ \notag &=\eta(\tilde\pi)-\mathbb{E}_{a\sim\tilde\pi}[V_\pi(s_0)]\\ \notag &=\eta(\tilde\pi)-\eta(\pi) \end{align} \]

根据式 \(\ref{rho-pi}\) ，式 \(\ref{policy-improve}\) 表明也可以写为以下形式： \[ \begin{equation} \label{policy-improve-2} \eta(\tilde\pi)=\eta(\pi)+\sum_s\rho_{\tilde\pi}(s)\sum_a\tilde\pi(a|s)A_\pi(s,a) \end{equation} \] 根据式 \(\ref{policy-improve-2}\)，如果对于每一个状态 \(s\)，使 \(\sum_a\tilde\pi(a|s)A_\pi(s,a)\geq 0\)，那么就保证了新策略是比旧策略好的。然而，式 \(\ref{policy-improve-2}\) 对于 \(\rho_{\tilde\pi}\) 的依赖导致在实际优化过程中无法直接计算，由此，考虑下面对 \(\eta\) 的局部估计： \[ \begin{equation} \label{surrogate-improve} L_\pi(\tilde\pi) =\eta(\pi)+\sum_s\rho_{\pi}(s)\sum_a\tilde\pi(a|s)A_\pi(s,a) \end{equation} \] \(L_\pi(\tilde\pi)\) 和 \(\eta(\tilde\pi)\) 唯一的区别在于把 \(\rho_{\tilde\pi}\) 换为了 \(\rho_{\pi}\)，接下来对 \(L_\pi(\tilde\pi)\) 进行分析。若有一参数化的策略\(\pi_\theta\) 且对 \(\theta\) 可微，则有： \[ L_{\pi_{\theta_0}}(\pi_{\theta_0})=\eta(\pi_{\theta_0})\\ \nabla_\theta L_{\pi_{\theta_0}}(\pi_{\theta})\big|_{\theta=\theta_0}=\nabla_\theta\eta(\pi_{\theta})\big|_{\theta=\theta_0} \] 因此，如果 \(\pi\) 更新的足够小，则提升 \(L_\pi\) 也会提升 \(\eta\)，考虑如下策略更新： \[ \tilde\pi(a|s)=(1-\alpha)\pi(a|s)+\alpha\pi'(a|s) \] 其中，\(\pi'=\arg\max_{\pi'}L_\pi(\tilde\pi)\)，从而有（证明见 (Kakade & Langford, 2002)）： \[ \eta(\tilde\pi)\geq L_{\pi}(\tilde\pi)-\frac{2\epsilon\gamma}{(1-\gamma(1-\alpha))(1-\gamma)}\alpha^2 \] 其中 \(\epsilon = \max_s|\mathbb{E}_{a\sim\tilde\pi(\cdot|s)}A_\pi(s,a)|\)

当 \(\alpha\) 足够小的时候则有： \[ \begin{equation} \label{surrogate-bound} \eta(\tilde\pi)\geq L_{\pi}(\tilde\pi)-\frac{2\epsilon\gamma}{(1-\gamma)^2}\alpha^2 \end{equation} \] 令 \(\alpha = D_{TV}^\max(\pi,\tilde\pi)\)，\(\epsilon = \max_s\max_aA_\pi(s,a)\)，则式 \(\ref{surrogate-bound}\) 依然成立，证明见 (Schulman et al., 2015)。考虑到 \(D_{TV}(p||q)^2\leq D_{KL}(p||q)\)，则有： \[ \begin{equation} \label{} \eta(\tilde\pi)\geq L_{\pi}(\tilde\pi)-CD_{KL}^\max(\pi,\tilde\pi) \end{equation} \] 令 \(M_i(\pi) = L_{\pi_i}(\pi)-CD_{KL}^\max(\pi,\tilde\pi)\)，则有： \[ M_i(\pi_i)=\eta(\pi_i)\\ M_i(\pi_{i+1})\leq \eta(\pi_{i+1})\\ M_i(\pi_{i+1})-M_i(\pi_i)\leq \eta(\pi_{i+1})-\eta(\pi_i) \] 因此，如果最大化 \(M_i(\pi)\)，则 \(\eta(\pi)\) 也会随之不断提升，对于参数化的策略 \(\pi_\theta\)，策略优化问题便可以转换为下面的优化问题： \[ \max_\theta\left[L_{\theta_{old}}(\theta)-CD_{KL}^\max(\theta_{old},\theta)\right] \] 转换为信赖域约束的形式为： \[ \begin{align} & \max_\theta L_{\theta_{old}}(\theta)\\ &s.t. D_{KL}^\max(\theta_{old},\theta)\leq \delta \end{align} \]

上述优化稳定的约束很难进行技术员，可以使用平均 \(KL\) 散度作为对该约束启发式的估计： \[ \bar D^\rho_{KL}(\theta_{1},\theta_2) = \mathbb{E}_{s\sim\rho}[D_{KL}(\pi_{\theta_1}(\cdot|s)||\pi_{\theta_2}(\cdot|s))] \] 在采样估计时有如下形式： \[ \begin{align} &\max_\theta \sum_s \rho_{\theta_{old}}\sum_a \pi_\theta(a|s)A_{\theta_{old}}(s,a)\\ &s.t. \bar D^{\rho_{\theta_{old}}}_{KL}(\theta_{old},\theta)\leq \delta \end{align} \] 使用重要性采样（Importance Samping）可得： \[ \sum_a \pi_\theta(a|s)A_{\theta_{old}}(s,a) = \mathbb{E}_{a\sim\pi_{\theta_{old}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\theta_{old}}(s,a)\right] \] 从而得到TRPO的优化目标： \[ \begin{align} &\max_\theta \mathbb{E}_{s\sim \rho_{\theta_{old}},\ a\sim\pi_{\theta_{old}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\theta_{old}}(s,a)\right]\\ &s.t.\mathbb{E}_{s\sim \rho_{\theta_{old}}}[D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_\theta(\cdot|s))] \end{align} \]

PPO

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal Policy Optimization Algorithms. Preprint at http://arxiv.org/abs/1707.06347 (2017).

TRPO的优化目标为： \[ \begin{align} &\max_\theta \mathbb{E}_{s\sim \rho_{\theta_{old}},\ a\sim\pi_{\theta_{old}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\theta_{old}}(s,a)\right]\\ &s.t.\mathbb{E}_{s\sim \rho_{\theta_{old}}}[D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_\theta(\cdot|s))] \end{align} \] 将约束转换为一个惩罚项，即可将该优化问题转换为无约束优化 \[ \max_\theta \mathbb{E}_{s\sim \rho_{\theta_{old}},\ a\sim\pi_{\theta_{old}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\theta_{old}}(s,a)-\beta D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_\theta(\cdot|s))\right] \] 令 \(r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\)，则\(r_t(\theta_{old})=1\)，TRPO的近似优化目标为： \[ L^{CPI}(\theta) =\mathbb{E}_t\left[r_t(\theta)\hat A_t\right] \] 在无约束的情况下，\(L^{CPI}\) 可能会导致一次优化产生很大的策略更新，为此，约束 \(r_t(\theta)\) 在 1 附近来限制优化幅度： \[ L^{CLIP}(\theta) =\mathbb{E}_t\left[\min(r_t(\theta)\hat A_t,\text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat A_t\right] \]

IMPALA

Espeholt, L. et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. in Proceedings of the 35th International Conference on Machine Learning 1407–1416 (PMLR, 2018).

IMPALA 采用异步收集经验的方式来提高采样效率，使用 V-trace 解决采样策略和更新策略不一致的问题。

V-trace Target

\[ v_t=V(s_t)+\sum_{\tau=t}^{t+n-1}\gamma^{\tau-t}(\prod_{i=t}^{\tau-1}c_i)\delta_\tau V \]

其中，\(\delta_\tau V=\rho_\tau(r_\tau+\gamma V(s_{\tau+1})-V(s_\tau))\)，\(\rho_\tau=\min(\bar\rho,\frac{\pi(a_\tau|s_\tau)}{\mu(a_\tau|s_\tau)})\)，\(c_i=\min(\bar c,\frac{\pi(a_i|s_i)}{\mu(a_i|s_i)})\)，规定当\(t=\tau\)时，\(\prod_{i=t}^{\tau-1}c_i=1\)，此外\(\bar\rho\geq\bar c\)，当\(\pi=\mu\)时，\(\rho_\tau=1\)且\(c_i=1\)，则上式转变为 on-policy n-step \(TD\)的形式： \[ \begin{align} \notag v_t &=V(s_t)+\sum_{\tau=t}^{t+n-1}\gamma^{\tau-t}(r_\tau+\gamma V(s_{\tau+1})-V(s_\tau)) \notag\\ &=\sum_{\tau=t}^{t+n-1}\gamma^{\tau-t}r_\tau+\gamma^nV(s_{t+n}) \end{align} \]

V-trace 可以由如下递归形式计算： \[ v_t=V(s_t)+\delta_tV+\gamma c_t(v_{t+1}-V(s_{t+1})) \] 可以考虑给 \(c_i\) 的计算中加入类似 \(Retrace(\lambda)\) 的折扣系数 \(\lambda\in[0,1]\)，得到 \(c_i=\lambda\min(\bar c,\frac{\pi(a_i|s_i)}{\mu(a_i|s_i)})\)，在 on-policy 场景下，当 \(n=\infty\) 时，V-trace 变为 \(TD(\lambda)\)

Off-Policy TRPO

Meng, W., Zheng, Q., Shi, Y. & Pan, G. An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning. IEEE Transactions on Neural Networks and Learning Systems 33, 2223–2235 (2022).

策略优势定理为\(\ref{policy-improve}\)，在 TRPO 中使用其 on-policy 的近似形式 \(\ref{surrogate-improve}\)，两者的区别在于 TRPO 使用了近似状态分布 \(\rho_\pi\) 替代策略分布 \(\rho_{\tilde\pi}\)。Off-policy TRPO 更进一步，采用行为策略 \(\mu\) 来进行经验收集更新策略 \(\pi\)，从而提出新的近似形式： \[ \begin{equation} L_{\pi,\mu}(\tilde\pi)=\eta(\pi)+\sum_s\rho_\mu(s)\sum_a\tilde\pi(a|s)A_\pi(s,a) \end{equation} \] 由该近似推出 off-policy 的优化目标： \[ \begin{align} \notag & \max_\theta \mathbb{E}_{s\sim\rho_\mu,a\sim\mu}\left[\frac{\pi_\theta(a|s)}{\mu(a|s)}A_{\pi_{\theta_{old}}}(s,a)\right]\\ & s.t.\bar D_{KL}^{\rho_\mu,sqrt}(\mu,\theta_{old})D_{KL}^{\rho_\mu,sqrt}(\theta_{old},\theta)+D_{KL}^{\rho_\mu}(\theta_{old},\theta)\leq \delta \end{align} \] 详细证明见(Meng et al., 2022)

Off-Policy PPO

Meng, W., Zheng, Q., Pan, G. & Yin, Y. Off-Policy Proximal Policy Optimization. Proceedings of the AAAI Conference on Artificial Intelligence 37, 9162–9170 (2023).

off-policy trpo 的 clip 近似形式，使用如下 clipped surrogate object： \[ L^{\text{CLIP}}_{\text{Off-Policy}}=\mathbb{E}_{s\sim\rho_\mu,a\sim\mu}\left[\min[r(s,a)A_{\pi_{old}}(s,a), \text{clip}(r(s,a),l(s,a),h(s,a)),\hat A_{\pi_{old}}(s,a)]\right] \] 其中 \(r(s,a)=\frac{\pi(a|s)}{\mu(a|s)}\)，\(l(s,a)=\frac{\pi_{old}(a|s)}{\mu(a|s)}(1-\epsilon)\)，\(h(s,a)=\frac{\pi_{old}(a|s)}{\mu(a|s)}(1+\epsilon)\)

详见(Meng et al., 2023)

Behavior PPO

Zhuang, Z., Lei, K., Liu, J., Wang, D. & Guo, Y. Behavior Proximal Policy Optimization. in Proceedings of the eleventh International Conference on Learning Representation (2023).

在线同策略算法天然可以解决离线强化学习问题

该工作的架构为 BC（行为克隆）+RL（强化学习），RL部分通过保守更新和 \(\epsilon\) 来实现在 offline dataset 上的多步更新。

根据 \(\ref{policy-improve}\) 策略优势可写为： \[ J_\Delta (\pi,\hat\pi_\beta)=\mathbb{E}_{s\sim\rho_{\pi},a\sim\pi(\cdot|s)}[A_{\hat\pi_\beta}(s,a)] \] 其中，\(\hat \pi_\beta\) 为行为策略（behavior policy，可以从 dataset 通过行为克隆学习）

提出其在 offline dataset \(\mathcal{D}\) 上的近似形式为： \[ \hat J_\Delta (\pi,\hat\pi_\beta)=\mathbb{E}_{s\sim\rho_{\mathcal{D}},a\sim\pi(\cdot|s)}[A_{\hat\pi_\beta}(s,a)] \] 两者之间的差距为： \[ \begin{align} \notag J_\Delta (\pi,\hat\pi_\beta)\geq &\hat J_\Delta (\pi,\hat\pi_\beta)\\ \notag &-4\gamma\mathbb{A}_{\hat\pi_\beta}\cdot\max_s D_{TV}(\pi||\hat\pi_\beta)[s]\cdot\mathbb{E}_{s\sim\rho_{\hat\pi_\beta}(\cdot)}[D_{TV}(\pi||\hat\pi_\beta)[s]] \\ \label{behavior-theorem-1} &-2\gamma\mathbb{A}_{\hat\pi_\beta}\cdot\max_s D_{TV}(\pi||\hat\pi_\beta)[s]\cdot\mathbb{E}_{s\sim\rho_{\mathcal{D}}(\cdot)}[1-\hat\pi_\beta(a|s)] \end{align} \] 其中，\(\mathbb{A}_{\hat\pi_\beta}=\max_{s,a}|A_{\hat\pi_\beta}(s,a)|\)，详细证明见(Zhuang et al., 2023)。由式 \(\ref{behavior-theorem-1}\) 可知，要保证优化目标 \(J_\Delta (\pi,\hat\pi_\beta)\) 不下降，应该在最大化 \(\mathbb{E}_{s\sim\rho_{\mathcal{D}},a\sim\pi(\cdot|s)}[A_{\hat\pi_\beta}(s,a)]\) 的同时最小化 \(\max_s D_{TV}(\pi||\hat\pi_\beta)[s]\)，由此便得到了从 \(\hat\pi_\beta\) 出发的单步更新。

考虑在离线数据集 \(\mathcal{D}\) 上进行多步更新，即在 \(\pi_k\) 的基础上优化得到 \(\pi_{k+1}\) ，根据策略优势定理，优化目标为： \[ J_\Delta(\pi,\pi_k)=\mathbb{E}_{s\sim\rho_{\pi},a\sim\pi(\cdot|s)}[A_{\pi_k}(s,a)] \]

考虑状态从离线数据集 \(\mathcal{D}\) 进行采样的如下近似形式： \[ \hat J_\Delta(\pi,\pi_k)=\mathbb{E}_{s\sim\rho_{\mathcal{D}},a\sim\pi(\cdot|s)}[A_{\pi_k}(s,a)] \] 两者之间的差距为： \[ \begin{align} \notag J_\Delta(\pi,\pi_k)\geq&\hat J_\Delta(\pi,\pi_k)\\ \notag &-4\gamma\mathbb{A}_{k}\cdot\max_s D_{TV}(\pi||\pi_k)[s]\cdot\mathbb{E}_{s\sim\rho_{\pi_k}(\cdot)}[D_{TV}(\pi||\pi_k)[s]] \\ \notag &-4\gamma\mathbb{A}_{k}\cdot\max_s D_{TV}(\pi||\pi_k)[s]\cdot\mathbb{E}_{s\sim\rho_{\hat\pi_\beta}(\cdot)}[D_{TV}(\pi_k||\hat\pi_\beta)[s]] \\ \label{behavior-theorem-2} &-2\gamma\mathbb{A}_{\pi_k}\cdot\max_s D_{TV}(\pi||\pi_k)[s]\cdot\mathbb{E}_{s\sim\rho_{\mathcal{D}}(\cdot)}[1-\hat\pi_\beta(a|s)] \end{align} \] 其中，\(\mathbb{A}_{\pi_k}=\max_{s,a}|A_{\pi_k}(s,a)|\)，详细证明见(Zhuang et al., 2023)。由式 \(\ref{behavior-theorem-2}\) 可知，若保证 \(J_\Delta(\pi,\pi_k)\) 非降，则需要在最大化 \(\mathbb{E}_{s\sim\rho_{\mathcal{D}},a\sim\pi(\cdot|s)}[A_{\pi_k}(s,a)]\) 的同时最小化 \(\mathbb{A}_{k}\cdot\max_s D_{TV}(\pi||\pi_k)[s]\)，从而实现在离线数据集 \(\mathcal{D}\) 上的多步更新。

上述结论可写为： \[ \begin{align} \notag &\max_\pi \mathbb{E}_{s\sim\rho_{\mathcal{D}},a\sim\pi(\cdot|s)}[A_{\pi_k}(s,a)]\\ \notag &s.t.\max_s D_{TV}(\pi||\pi_k)\leq \epsilon \end{align} \] 经过一系列推导之后得到 clip 形式的无约束优化目标为： \[ L_k=\mathbb{E}_{s\sim\rho_{\mathcal{D}},a\sim\pi_k(\cdot|s)}\left[\min\left(\frac{\pi(a|s)}{\pi_k(a|s)}A_{\pi_k}(s,a),\text{clip}\left(\frac{\pi(a|s)}{\pi_k(a|s)},1-2\epsilon, 1+2\epsilon\right)A_{\pi_k}(s,a)\right)\right] \]