作者的话在上一篇中我们学习了Actor-Critic架构但传统的策略梯度方法存在训练不稳定的问题——策略更新幅度过大可能导致性能崩溃。PPOProximal Policy Optimization通过巧妙地限制策略更新的幅度在保证稳定性的同时保持高样本效率成为目前最流行的强化学习算法。OpenAI、DeepMind、Google等顶级AI实验室都在使用PPO。本文将带你深入理解PPO的原理并实现一个能完成复杂连续控制任务的智能体一、为什么需要PPO1.1 传统策略梯度的问题回顾REINFORCE和A2C的策略梯度∇_θ J(θ) E[∇_θ log π_θ(a|s) · A(s,a)]存在的问题问题说明后果步长敏感学习率难以选择太小收敛慢太大性能崩溃单步更新每个样本只能用一次样本效率低训练不稳定策略可能突然变差需要频繁保存检查点1.2 TRPO的解决方案与局限TRPOTrust Region Policy Optimization提出了一个优雅的解决方案核心思想限制新旧策略的差异确保每次更新都在信任区域内。约束优化问题max E[(π_θ(a|s) / π_{θ_old}(a|s)) · A(s,a)] 约束: D_KL(π_{θ_old} || π_θ) ≤ δTRPO的优点理论上保证策略单调改进训练非常稳定。TRPO的缺点实现复杂需要计算Fisher信息矩阵计算量大二阶优化。1.3 PPO的诞生2017年OpenAI提出PPO目标是在保持TRPO稳定性的同时像A2C一样简单。PPO的核心创新Clip机制通过裁剪代替复杂的约束优化简洁实现一阶优化易于实现高样本效率可以多次复用同一批数据特性A2CTRPOPPO实现难度简单复杂较简单训练稳定性中极高极高样本效率低中高推荐程度入门用研究用生产用二、PPO的核心思想2.1 策略比率Probability Ratio定义r_t(θ) π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)直观理解r_t(θ) 1新策略比旧策略更可能选择动作a_tr_t(θ) 1新策略比旧策略更不可能选择动作a_tr_t(θ) 1新旧策略相同2.2 Clipped Surrogate ObjectivePPO-Clip的解决方案L^{CLIP}(θ) E[min(r_t(θ) · A_t, clip(r_t(θ), 1-ε, 1ε) · A_t)] Clip函数 clip(x, 1-ε, 1ε) 1-ε if x 1-ε x if 1-ε ≤ x ≤ 1ε 1ε if x 1ε为什么有效情况1: A_t 0 (动作是好的应该增加概率) - r 1ε: 正常优化 - r 1ε: 被裁剪防止过度优化 情况2: A_t 0 (动作是差的应该减少概率) - r 1-ε: 正常优化 - r 1-ε: 被裁剪防止过度优化2.3 完整目标函数PPO的完整损失函数L^{PPO}(θ) E[L^{CLIP}(θ) - c_1 · L^{VF}(θ) c_2 · H(π_θ)] 其中 - L^{CLIP}(θ)Clipped策略损失 - L^{VF}(θ) (V_θ(s) - V^{target})^2价值函数损失 - H(π_θ)策略熵鼓励探索 - c_1, c_2系数超参数三、PPO的完整实现3.1 PPO网络架构import torch import torch.nn as nn import torch.optim as optim import numpy as np from torch.distributions import Categorical, Normal class PPONetwork(nn.Module): PPO网络共享特征 Actor/Critic头 def __init__(self, state_dim, action_dim, hidden_dim256, continuousFalse): super(PPONetwork, self).__init__() self.continuous continuous # 共享特征提取层 self.feature nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU() ) # Actor头 if continuous: self.actor_mean nn.Linear(hidden_dim, action_dim) self.actor_log_std nn.Parameter(torch.zeros(action_dim)) else: self.actor nn.Linear(hidden_dim, action_dim) # Critic头 self.critic nn.Linear(hidden_dim, 1) def forward(self, state): features self.feature(state) if self.continuous: mean self.actor_mean(features) std torch.exp(self.actor_log_std) dist Normal(mean, std) else: action_probs torch.softmax(self.actor(features), dim-1) dist Categorical(action_probs) value self.critic(features) return dist, value3.2 经验收集缓冲区class PPOBuffer: PPO经验缓冲区存储trajectory数据 def __init__(self, state_dim, action_dim, buffer_size, continuousFalse): self.state_dim state_dim self.action_dim action_dim self.buffer_size buffer_size self.continuous continuous # 预分配内存 self.states np.zeros((buffer_size, state_dim), dtypenp.float32) self.actions np.zeros((buffer_size, action_dim) if continuous else (buffer_size,), dtypenp.float32 if continuous else np.int64) self.rewards np.zeros(buffer_size, dtypenp.float32) self.values np.zeros(buffer_size, dtypenp.float32) self.log_probs np.zeros(buffer_size, dtypenp.float32) self.dones np.zeros(buffer_size, dtypenp.float32) self.ptr 0 def store(self, state, action, reward, value, log_prob, done): idx self.ptr % self.buffer_size self.states[idx] state self.actions[idx] action self.rewards[idx] reward self.values[idx] value self.log_probs[idx] log_prob self.dones[idx] done self.ptr 1 def compute_advantages(self, gamma0.99, gae_lambda0.95): 计算优势函数GAE advantages np.zeros_like(self.rewards) last_gae 0 for t in reversed(range(len(self.rewards))): if t len(self.rewards) - 1: next_value 0 else: next_value self.values[t 1] delta self.rewards[t] gamma * next_value * (1 - self.dones[t]) - self.values[t] advantages[t] last_gae delta gamma * gae_lambda * (1 - self.dones[t]) * last_gae returns advantages self.values return advantages, returns3.3 PPO训练器class PPOAgent: PPOProximal Policy Optimization智能体 def __init__(self, state_dim, action_dim, lr3e-4, gamma0.99, gae_lambda0.95, clip_epsilon0.2, value_coef0.5, entropy_coef0.01, max_grad_norm0.5, continuousFalse, update_epochs10, batch_size64): self.gamma gamma self.gae_lambda gae_lambda self.clip_epsilon clip_epsilon self.value_coef value_coef self.entropy_coef entropy_coef self.max_grad_norm max_grad_norm self.update_epochs update_epochs self.batch_size batch_size self.continuous continuous self.network PPONetwork(state_dim, action_dim, continuouscontinuous) self.optimizer optim.Adam(self.network.parameters(), lrlr) def select_action(self, state, deterministicFalse): state_tensor torch.FloatTensor(state).unsqueeze(0) with torch.no_grad(): dist, value self.network(state_tensor) if deterministic: action dist.mean if self.continuous else dist.probs.argmax(dim-1) else: action dist.sample() log_prob dist.log_prob(action) if self.continuous: log_prob log_prob.sum(dim-1) return action.cpu().numpy()[0], log_prob.cpu().numpy()[0], value.cpu().numpy()[0][0] def update(self, buffer_data): states buffer_data[states] actions buffer_data[actions] old_log_probs buffer_data[log_probs] advantages buffer_data[advantages] returns buffer_data[returns] total_loss 0 total_policy_loss 0 total_value_loss 0 total_entropy 0 # 多次epochs更新PPO的关键 for epoch in range(self.update_epochs): indices torch.randperm(len(states)) for start in range(0, len(states), self.batch_size): end start self.batch_size idx indices[start:end] batch_states states[idx] batch_actions actions[idx] batch_old_log_probs old_log_probs[idx] batch_advantages advantages[idx] batch_returns returns[idx] # 评估当前策略 dist, values self.network(batch_states) log_probs dist.log_prob(batch_actions) entropy dist.entropy() if self.continuous: log_probs log_probs.sum(dim-1) entropy entropy.sum(dim-1) # 计算策略比率 ratio torch.exp(log_probs - batch_old_log_probs) # Clipped策略损失 surr1 ratio * batch_advantages surr2 torch.clamp(ratio, 1 - self.clip_epsilon, 1 self.clip_epsilon) * batch_advantages policy_loss -torch.min(surr1, surr2).mean() # 价值损失 value_loss F.mse_loss(values.squeeze(-1), batch_returns) # 熵奖励 entropy_loss -entropy.mean() # 总损失 loss policy_loss self.value_coef * value_loss self.entropy_coef * entropy_loss # 反向传播 self.optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(self.network.parameters(), self.max_grad_norm) self.optimizer.step() total_loss loss.item() total_policy_loss policy_loss.item() total_value_loss value_loss.item() total_entropy entropy.mean().item() n_updates self.update_epochs * (len(states) // self.batch_size 1) return { loss: total_loss / n_updates, policy_loss: total_policy_loss / n_updates, value_loss: total_value_loss / n_updates, entropy: total_entropy / n_updates }四、实战项目LunarLander连续控制4.1 LunarLander环境介绍目标控制登月器平稳降落在月球表面。状态空间8维连续x, y坐标x, y速度角度角速度左腿接触右腿接触动作空间2维连续主引擎推力[0, 1]侧向引擎[-1, 1]4.2 完整训练代码import gym import numpy as np import torch import matplotlib.pyplot as plt class LunarLanderTrainer: def __init__(self): self.env gym.make(LunarLander-v2, continuousTrue) self.state_dim self.env.observation_space.shape[0] self.action_dim self.env.action_space.shape[0] self.agent PPOAgent( state_dimself.state_dim, action_dimself.action_dim, lr3e-4, gamma0.99, gae_lambda0.95, clip_epsilon0.2, value_coef0.5, entropy_coef0.01, max_grad_norm0.5, continuousTrue, update_epochs10, batch_size64 ) self.buffer_size 2048 self.buffer PPOBuffer(self.state_dim, self.action_dim, self.buffer_size, continuousTrue) self.episode_rewards [] def train(self, total_timesteps500000): state self.env.reset() if isinstance(state, tuple): state state[0] episode_reward 0 timestep 0 episode 0 while timestep total_timesteps: for _ in range(self.buffer_size): action, log_prob, value self.agent.select_action(state) result self.env.step(action) if len(result) 5: next_state, reward, terminated, truncated, _ result done terminated or truncated else: next_state, reward, done, _ result self.buffer.store(state, action, reward, value, log_prob, done) state next_state episode_reward reward timestep 1 if done: self.episode_rewards.append(episode_reward) episode 1 if episode % 10 0: avg_reward np.mean(self.episode_rewards[-100:]) print(fEpisode {episode}, Reward: {episode_reward:.2f}, Avg: {avg_reward:.2f}) state self.env.reset() if isinstance(state, tuple): state state[0] episode_reward 0 # 获取缓冲区数据并更新 advantages, returns self.buffer.compute_advantages() advantages (advantages - advantages.mean()) / (advantages.std() 1e-8) buffer_data { states: torch.FloatTensor(self.buffer.states), actions: torch.FloatTensor(self.buffer.actions), log_probs: torch.FloatTensor(self.buffer.log_probs), advantages: torch.FloatTensor(advantages), returns: torch.FloatTensor(returns) } loss_dict self.agent.update(buffer_data) if episode % 10 0: print(f Loss: {loss_dict[loss]:.4f}, Policy: {loss_dict[policy_loss]:.4f}) self.buffer.ptr 0 if episode % 100 0 and len(self.episode_rewards) 0: avg_reward np.mean(self.episode_rewards[-100:]) if avg_reward 200: print(f Environment solved at episode {episode}!) break return self.episode_rewards # 运行训练 if __name__ __main__: trainer LunarLanderTrainer() print( Starting PPO training on LunarLander-v2...) rewards trainer.train(total_timesteps500000)4.3 预期训练结果Episode 10, Reward: -150.23, Avg: -180.45 Loss: 0.0234, Policy: -0.0123 Episode 100, Reward: -50.12, Avg: -89.34 Episode 300, Reward: 120.45, Avg: 85.67 Episode 500, Reward: 230.78, Avg: 210.34 Environment solved at episode 500!五、PPO的调参与优化超参数作用推荐值调整建议lr学习率3e-4从1e-4到1e-3尝试γ折扣因子0.99长序列任务可用0.995gae_lambdaGAE参数0.950.9-0.99之间clip_epsilon裁剪参数0.20.1-0.3之间update_epochs更新轮数105-20之间六、PPO的应用与展望6.1 PPO的实际应用应用领域代表工作说明游戏AIOpenAI Five (Dota2)使用PPO训练击败世界冠军机器人控制Boston Dynamics运动控制策略学习大语言模型ChatGPT (RLHF)基于人类反馈的PPO优化自动驾驶Waymo决策规划系统6.2 学习路径总结第33篇Q-Learning DQN ↓ 第34篇Actor-Critic (A2C/A3C) ↓ 第35篇PPO (本篇文章) ↓ 下一步SAC / 模型-based方法下一篇预告【第36篇】多智能体强化学习入门让多个AI协作与竞争我们将进入更复杂的场景——多个智能体同时学习和交互探索涌现行为和协作策略本文为系列第35篇详细讲解了PPO算法的原理与实战。有任何问题欢迎在评论区交流标签PPO、Proximal Policy Optimization、深度强化学习、连续控制、LunarLander
人工智能【第35篇】PPO算法详解:近端策略优化实战
作者的话在上一篇中我们学习了Actor-Critic架构但传统的策略梯度方法存在训练不稳定的问题——策略更新幅度过大可能导致性能崩溃。PPOProximal Policy Optimization通过巧妙地限制策略更新的幅度在保证稳定性的同时保持高样本效率成为目前最流行的强化学习算法。OpenAI、DeepMind、Google等顶级AI实验室都在使用PPO。本文将带你深入理解PPO的原理并实现一个能完成复杂连续控制任务的智能体一、为什么需要PPO1.1 传统策略梯度的问题回顾REINFORCE和A2C的策略梯度∇_θ J(θ) E[∇_θ log π_θ(a|s) · A(s,a)]存在的问题问题说明后果步长敏感学习率难以选择太小收敛慢太大性能崩溃单步更新每个样本只能用一次样本效率低训练不稳定策略可能突然变差需要频繁保存检查点1.2 TRPO的解决方案与局限TRPOTrust Region Policy Optimization提出了一个优雅的解决方案核心思想限制新旧策略的差异确保每次更新都在信任区域内。约束优化问题max E[(π_θ(a|s) / π_{θ_old}(a|s)) · A(s,a)] 约束: D_KL(π_{θ_old} || π_θ) ≤ δTRPO的优点理论上保证策略单调改进训练非常稳定。TRPO的缺点实现复杂需要计算Fisher信息矩阵计算量大二阶优化。1.3 PPO的诞生2017年OpenAI提出PPO目标是在保持TRPO稳定性的同时像A2C一样简单。PPO的核心创新Clip机制通过裁剪代替复杂的约束优化简洁实现一阶优化易于实现高样本效率可以多次复用同一批数据特性A2CTRPOPPO实现难度简单复杂较简单训练稳定性中极高极高样本效率低中高推荐程度入门用研究用生产用二、PPO的核心思想2.1 策略比率Probability Ratio定义r_t(θ) π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)直观理解r_t(θ) 1新策略比旧策略更可能选择动作a_tr_t(θ) 1新策略比旧策略更不可能选择动作a_tr_t(θ) 1新旧策略相同2.2 Clipped Surrogate ObjectivePPO-Clip的解决方案L^{CLIP}(θ) E[min(r_t(θ) · A_t, clip(r_t(θ), 1-ε, 1ε) · A_t)] Clip函数 clip(x, 1-ε, 1ε) 1-ε if x 1-ε x if 1-ε ≤ x ≤ 1ε 1ε if x 1ε为什么有效情况1: A_t 0 (动作是好的应该增加概率) - r 1ε: 正常优化 - r 1ε: 被裁剪防止过度优化 情况2: A_t 0 (动作是差的应该减少概率) - r 1-ε: 正常优化 - r 1-ε: 被裁剪防止过度优化2.3 完整目标函数PPO的完整损失函数L^{PPO}(θ) E[L^{CLIP}(θ) - c_1 · L^{VF}(θ) c_2 · H(π_θ)] 其中 - L^{CLIP}(θ)Clipped策略损失 - L^{VF}(θ) (V_θ(s) - V^{target})^2价值函数损失 - H(π_θ)策略熵鼓励探索 - c_1, c_2系数超参数三、PPO的完整实现3.1 PPO网络架构import torch import torch.nn as nn import torch.optim as optim import numpy as np from torch.distributions import Categorical, Normal class PPONetwork(nn.Module): PPO网络共享特征 Actor/Critic头 def __init__(self, state_dim, action_dim, hidden_dim256, continuousFalse): super(PPONetwork, self).__init__() self.continuous continuous # 共享特征提取层 self.feature nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU() ) # Actor头 if continuous: self.actor_mean nn.Linear(hidden_dim, action_dim) self.actor_log_std nn.Parameter(torch.zeros(action_dim)) else: self.actor nn.Linear(hidden_dim, action_dim) # Critic头 self.critic nn.Linear(hidden_dim, 1) def forward(self, state): features self.feature(state) if self.continuous: mean self.actor_mean(features) std torch.exp(self.actor_log_std) dist Normal(mean, std) else: action_probs torch.softmax(self.actor(features), dim-1) dist Categorical(action_probs) value self.critic(features) return dist, value3.2 经验收集缓冲区class PPOBuffer: PPO经验缓冲区存储trajectory数据 def __init__(self, state_dim, action_dim, buffer_size, continuousFalse): self.state_dim state_dim self.action_dim action_dim self.buffer_size buffer_size self.continuous continuous # 预分配内存 self.states np.zeros((buffer_size, state_dim), dtypenp.float32) self.actions np.zeros((buffer_size, action_dim) if continuous else (buffer_size,), dtypenp.float32 if continuous else np.int64) self.rewards np.zeros(buffer_size, dtypenp.float32) self.values np.zeros(buffer_size, dtypenp.float32) self.log_probs np.zeros(buffer_size, dtypenp.float32) self.dones np.zeros(buffer_size, dtypenp.float32) self.ptr 0 def store(self, state, action, reward, value, log_prob, done): idx self.ptr % self.buffer_size self.states[idx] state self.actions[idx] action self.rewards[idx] reward self.values[idx] value self.log_probs[idx] log_prob self.dones[idx] done self.ptr 1 def compute_advantages(self, gamma0.99, gae_lambda0.95): 计算优势函数GAE advantages np.zeros_like(self.rewards) last_gae 0 for t in reversed(range(len(self.rewards))): if t len(self.rewards) - 1: next_value 0 else: next_value self.values[t 1] delta self.rewards[t] gamma * next_value * (1 - self.dones[t]) - self.values[t] advantages[t] last_gae delta gamma * gae_lambda * (1 - self.dones[t]) * last_gae returns advantages self.values return advantages, returns3.3 PPO训练器class PPOAgent: PPOProximal Policy Optimization智能体 def __init__(self, state_dim, action_dim, lr3e-4, gamma0.99, gae_lambda0.95, clip_epsilon0.2, value_coef0.5, entropy_coef0.01, max_grad_norm0.5, continuousFalse, update_epochs10, batch_size64): self.gamma gamma self.gae_lambda gae_lambda self.clip_epsilon clip_epsilon self.value_coef value_coef self.entropy_coef entropy_coef self.max_grad_norm max_grad_norm self.update_epochs update_epochs self.batch_size batch_size self.continuous continuous self.network PPONetwork(state_dim, action_dim, continuouscontinuous) self.optimizer optim.Adam(self.network.parameters(), lrlr) def select_action(self, state, deterministicFalse): state_tensor torch.FloatTensor(state).unsqueeze(0) with torch.no_grad(): dist, value self.network(state_tensor) if deterministic: action dist.mean if self.continuous else dist.probs.argmax(dim-1) else: action dist.sample() log_prob dist.log_prob(action) if self.continuous: log_prob log_prob.sum(dim-1) return action.cpu().numpy()[0], log_prob.cpu().numpy()[0], value.cpu().numpy()[0][0] def update(self, buffer_data): states buffer_data[states] actions buffer_data[actions] old_log_probs buffer_data[log_probs] advantages buffer_data[advantages] returns buffer_data[returns] total_loss 0 total_policy_loss 0 total_value_loss 0 total_entropy 0 # 多次epochs更新PPO的关键 for epoch in range(self.update_epochs): indices torch.randperm(len(states)) for start in range(0, len(states), self.batch_size): end start self.batch_size idx indices[start:end] batch_states states[idx] batch_actions actions[idx] batch_old_log_probs old_log_probs[idx] batch_advantages advantages[idx] batch_returns returns[idx] # 评估当前策略 dist, values self.network(batch_states) log_probs dist.log_prob(batch_actions) entropy dist.entropy() if self.continuous: log_probs log_probs.sum(dim-1) entropy entropy.sum(dim-1) # 计算策略比率 ratio torch.exp(log_probs - batch_old_log_probs) # Clipped策略损失 surr1 ratio * batch_advantages surr2 torch.clamp(ratio, 1 - self.clip_epsilon, 1 self.clip_epsilon) * batch_advantages policy_loss -torch.min(surr1, surr2).mean() # 价值损失 value_loss F.mse_loss(values.squeeze(-1), batch_returns) # 熵奖励 entropy_loss -entropy.mean() # 总损失 loss policy_loss self.value_coef * value_loss self.entropy_coef * entropy_loss # 反向传播 self.optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(self.network.parameters(), self.max_grad_norm) self.optimizer.step() total_loss loss.item() total_policy_loss policy_loss.item() total_value_loss value_loss.item() total_entropy entropy.mean().item() n_updates self.update_epochs * (len(states) // self.batch_size 1) return { loss: total_loss / n_updates, policy_loss: total_policy_loss / n_updates, value_loss: total_value_loss / n_updates, entropy: total_entropy / n_updates }四、实战项目LunarLander连续控制4.1 LunarLander环境介绍目标控制登月器平稳降落在月球表面。状态空间8维连续x, y坐标x, y速度角度角速度左腿接触右腿接触动作空间2维连续主引擎推力[0, 1]侧向引擎[-1, 1]4.2 完整训练代码import gym import numpy as np import torch import matplotlib.pyplot as plt class LunarLanderTrainer: def __init__(self): self.env gym.make(LunarLander-v2, continuousTrue) self.state_dim self.env.observation_space.shape[0] self.action_dim self.env.action_space.shape[0] self.agent PPOAgent( state_dimself.state_dim, action_dimself.action_dim, lr3e-4, gamma0.99, gae_lambda0.95, clip_epsilon0.2, value_coef0.5, entropy_coef0.01, max_grad_norm0.5, continuousTrue, update_epochs10, batch_size64 ) self.buffer_size 2048 self.buffer PPOBuffer(self.state_dim, self.action_dim, self.buffer_size, continuousTrue) self.episode_rewards [] def train(self, total_timesteps500000): state self.env.reset() if isinstance(state, tuple): state state[0] episode_reward 0 timestep 0 episode 0 while timestep total_timesteps: for _ in range(self.buffer_size): action, log_prob, value self.agent.select_action(state) result self.env.step(action) if len(result) 5: next_state, reward, terminated, truncated, _ result done terminated or truncated else: next_state, reward, done, _ result self.buffer.store(state, action, reward, value, log_prob, done) state next_state episode_reward reward timestep 1 if done: self.episode_rewards.append(episode_reward) episode 1 if episode % 10 0: avg_reward np.mean(self.episode_rewards[-100:]) print(fEpisode {episode}, Reward: {episode_reward:.2f}, Avg: {avg_reward:.2f}) state self.env.reset() if isinstance(state, tuple): state state[0] episode_reward 0 # 获取缓冲区数据并更新 advantages, returns self.buffer.compute_advantages() advantages (advantages - advantages.mean()) / (advantages.std() 1e-8) buffer_data { states: torch.FloatTensor(self.buffer.states), actions: torch.FloatTensor(self.buffer.actions), log_probs: torch.FloatTensor(self.buffer.log_probs), advantages: torch.FloatTensor(advantages), returns: torch.FloatTensor(returns) } loss_dict self.agent.update(buffer_data) if episode % 10 0: print(f Loss: {loss_dict[loss]:.4f}, Policy: {loss_dict[policy_loss]:.4f}) self.buffer.ptr 0 if episode % 100 0 and len(self.episode_rewards) 0: avg_reward np.mean(self.episode_rewards[-100:]) if avg_reward 200: print(f Environment solved at episode {episode}!) break return self.episode_rewards # 运行训练 if __name__ __main__: trainer LunarLanderTrainer() print( Starting PPO training on LunarLander-v2...) rewards trainer.train(total_timesteps500000)4.3 预期训练结果Episode 10, Reward: -150.23, Avg: -180.45 Loss: 0.0234, Policy: -0.0123 Episode 100, Reward: -50.12, Avg: -89.34 Episode 300, Reward: 120.45, Avg: 85.67 Episode 500, Reward: 230.78, Avg: 210.34 Environment solved at episode 500!五、PPO的调参与优化超参数作用推荐值调整建议lr学习率3e-4从1e-4到1e-3尝试γ折扣因子0.99长序列任务可用0.995gae_lambdaGAE参数0.950.9-0.99之间clip_epsilon裁剪参数0.20.1-0.3之间update_epochs更新轮数105-20之间六、PPO的应用与展望6.1 PPO的实际应用应用领域代表工作说明游戏AIOpenAI Five (Dota2)使用PPO训练击败世界冠军机器人控制Boston Dynamics运动控制策略学习大语言模型ChatGPT (RLHF)基于人类反馈的PPO优化自动驾驶Waymo决策规划系统6.2 学习路径总结第33篇Q-Learning DQN ↓ 第34篇Actor-Critic (A2C/A3C) ↓ 第35篇PPO (本篇文章) ↓ 下一步SAC / 模型-based方法下一篇预告【第36篇】多智能体强化学习入门让多个AI协作与竞争我们将进入更复杂的场景——多个智能体同时学习和交互探索涌现行为和协作策略本文为系列第35篇详细讲解了PPO算法的原理与实战。有任何问题欢迎在评论区交流标签PPO、Proximal Policy Optimization、深度强化学习、连续控制、LunarLander