The Impact Of AC Learning Rate Setting on Training Effect in DPG Algorithm.

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

Deep reinforcement learning algorithms have more hyperparameters to tune compared to deep learning. This article mainly discusses the impact of the learning rate settings for the actor and critic networks in the DDPG algorithm on the learning performance.

Recently, when reproducing the DDPG algorithm, I first followed the algorithm flow in the paper and then tested it on the continuous environment "Pendulum-v1". However, the training results were not satisfactory, and the convergence structure of the algorithm was poor, as shown in Figure 1.

Figure 1: Training results with actor and critic learning rates both set to 3e-4. Left: critic loss, Right: reward curve

So I looked for many reproductions of this algorithm to compare and find out where the problem was. After spending a lot of time confirming that the decision-making and updating processes of the reproduced DDPG algorithm were correct, I turned my attention to the setting of hyperparameters. Finally, I found that the learning rate of the critic network usually needs to be set slightly higher than that of the actor network. After increasing the learning rate of the critic net, the DDPG algorithm finally showed excellent performance and obtained a beautiful reward curve, as shown in Figure 2.

Figure 2: Learning performance after modifying the critic learning rate. Left: critic loss, Right: reward curve

After solving this problem, it raised another question: why do the actor and critic in the DDPG algorithm need different learning rates to make the algorithm work well? For this, I found discussions about this issue online. The following are several explanations summarized:

  • The learning rates of the actor and critic are two hyperparameters that need to be adjusted, and this setting has been found to produce good results in practice.
  • If the actor updates faster than the critic, the estimated Q values may not accurately reflect the value of the actions, because the critic's Q value function is based on past policy estimations.
  • Since the actor outputs specific actions, usually bounded, the learning rate can be smaller. On the other hand, the critic learns the expected discounted rewards, which are usually unbounded, so a larger learning rate is needed.

source: Why different learning rates for actor and critic : r/reinforcementlearning (reddit.com)

To further explore the impact of these two parameters on learning, I conducted a series of experiments using the DDPG algorithm on "Pendulum-v1".

  1. Fix the learning rate of the actor at 3e-4 and change the learning rate of the critic. The experimental results are shown in the following figure:

  2. Fix the learning rate of the critic at 3e-4 and change the learning rate of the actor. The experimental results are shown in the following figure:

  3. Keep the learning rates of the actor and critic the same, and change both of them. The experimental results are shown in the following figure:

Based on the above experimental results, the following conclusions can be drawn:

  • When the learning rate of the actor remains unchanged, appropriately increasing the learning rate of the critic can accelerate the convergence speed. If the learning rate of the critic is set too small, it may lead to slow convergence or even non-convergence.
  • When the learning rate of the critic is too small and the network is difficult to converge, adjusting only the learning rate of the actor cannot make the network converge.
  • The actor and critic can also converge when their learning rates are the same, but appropriate parameters need to be selected to achieve better learning performance.

Therefore, when using the DDPG algorithm, it is recommended to set a slightly higher learning rate for the critic network compared to the actor network, which can achieve better convergence performance.