Hung Yu Ling

On RL Stagnation


In Towards Deployable RL, the authors expressed concerns about stagnation in the field of reinforcement learning (RL) research. They identified five research practices causing the stagnation: overfitting to specific benchmarks, wrong focus, detached theory, uneven playing grounds, and lack of experimental rigour. As a solution, they advocate applying the “deployable RL” research framework, which focuses on solving challenges instead of beating benchmarks. This blog post is a response to the article.

Illustration of progress in RL over time
Innovation happens in bursts. Harder problems have longer incubation periods.

Reasons for Stagnation

The authors conflate different problems and misattribute them to “RL stagnation.” The causes they have identified are all contributing factors but cannot explain RL stagnation.

Indeed, it is more difficult to start researching RL methods because, unlike supervised learning, it often requires designing the environment (e.g., brachiation) in addition to the algorithm. The diversity of RL applications creates the impression that RL is moving slower than supervised learning. But the reality is that RL has covered a broader problem space. Bipedal locomotion, quadrupedal locomotion, and brachiation are distinctly unique. Not to mention chess/Go, StarCraft/Dota, matrix multiplication, and chip layout.

Another problem with the analysis is that the authors conflated RL and its applications, like robotics. RL should be compared to supervised learning, not CV or NLP. People don’t say supervised learning is stagnant, even though most people are still using Adam or SGD. It is important to make clear the difference between RL and applications.

Has RL research really stagnated?

RL, and more specifically its applications, feel stagnant because people subconsciously evaluate them with a higher standard.

In supervised learning, reaching human-level performance is the ceiling. Objectively, ChatGPT and LLMs speak at an average human level, solve grade-level math problems, and have a basic understanding of a wide range of topics. Similarly, for Stable Diffusion and the likes, their drawing ability is above average human but still comparable to professional artists.

With RL, AlphaGo Zero 2017 and AlphaTensor 2022 (fast matrix multiplication) have expanded our knowledge about the world. In 2021, RL was used to design chips at an objectively comparable or superior level to human experts. In each case, RL was used to discover things that no human had previously known, regardless of how niche the application is. Discoveries don’t happen every day. So it is normal to feel, from time to time, that RL and its related applications are stagnant.

The Real Problems (What does “RL Stagnation” mean?)

What the authors, and many others, meant by “RL stagnation” is a combination of:

  1. Improvements are not fundamental. Compared to supervised learning, RL algorithms converge slower and often to local optima. RL algorithms cannot be generally applied, e.g., different algorithms perform better or worse on different problems. While this is a valid criticism, it is not a reason to stop fundamental research (“RL first”) on RL algorithms.

  2. Popular benchmarks do not include “real” problems. Another problem, which the authors implied, is the lack of investment from large corporations and startups. The question should be why aren’t more companies putting out prizes for challenges they want to be solved. In robotics, giving away robots for free is clearly not economically viable. For medicine and others, presumably holding onto training data or simulators is more valuable than solving challenges quicker.

  3. RL applications have not provided tangible value. RL has been successfully applied to games, math, engineering, and probably more. It is easy to dismiss the progress because these applications are not as visible as CV and NLP applications.

The Solution

In conclusion, I believe RL has made and is making steady progress. It is worthwhile to recognize that RL is hard and RL applications are niche. The lack of media coverage around successful RL applications, at least compared to Stable Diffusion or ChatGPT, creates the impression that RL is stagnant. However, it is not a reason to stop fundamental research on RL algorithms, nor is it a reason to focus on only applications that creates tangible values Future algorithm research should focus on time efficiency instead of sample efficiency. Faster iterations is the key to enabling broader adoption.