Why is RL important, especially for LLMs?

For decades, reinforcement learning has been one of the most powerful and most interesting machine learning techniques available. There’s something deeply intriguing about observing an AI model become more intelligent over time, on its own, without anyone’s interference. It’s a bit like watching a plant sprout, or a child master a new skill. Growth is beautiful and magical.

Most people, including developers, perceive RL to be beyond their grasp. Accepting it as a technique used at large companies by teams of sophisticated scientists, they consign themselves to watching YouTube videos on the subject and occasionally discussing the current and potential future applications of RL in a world shaped by others. Gated behind diplomas and intimidating job titles, experiential learning is kept sterile in its ivory tower.

There was a time when this state of affairs was unavoidable. Until the launch of ChatGPT in November 2022, AI took the form of relatively niche models trained for particular tasks using painstakingly curated data. Even after gpt-3.5-turbo and successive models became widely used among the developer community and their generalizability was recognized, developers still lacked access to the model weights themselves. OpenAI eventually exposed a basic RFT endpoint, but its affordances were too brittle for non-academic tasks. OpenAI’s distrust of the open community translated into narrowly gated APIs designed to limit misuse, not empower the developer.

The first useful open-weight model, Llama 2, was released in July 2023. The fledgling open source community quickly coalesced around its architecture and teams organized to solve pressing problems. Axolotl and Unsloth built frameworks to simplify the process of adjusting model weights through SFT and DPO. vLLM and llama.cpp (among many others) provided inference engines that allow developers to run open models on their own hardware. Finally, the average person could customize an AI model and fully control its weights.

Something was missing. Despite the open-source community’s vibrant investment in producing tools for shaping models to developer needs, performance lagged behind closed labs. While winning on cost and speed, the quality of specialized open models was usually inferior to that of the most expensive closed alternatives, even after putting significant effort into the training process. Preparing training datasets to close the difference was difficult and fraught with error.

Most believed that a performance gap was inevitable. Organizations could not afford to invest the billions of dollars in data and training infrastructure necessary to train a frontier model and then open source the result. Shareholders wouldn’t stand for it, and billionaire individuals lacked the appetite. DeepSeek and Alibaba took up Meta’s dropped banner releasing Qwen2.5 and R1, but limited compute and concerns over the safety of their models curbed developer expectations. It seemed inevitable that the best AI applications would be built either by deeply specialized teams or using centralized closed models.

But some individuals saw the world differently. They realized that DeepSeek’s GRPO paper implied not only that a model could improve at verifiable tasks (like math and coding) but unverifiable ones for which a judge model could serve as the reward function. By giving an LLM the ability to run through a variety of real-world scenarios many times in parallel and assigning higher rewards for better performance, it might be possible to teach an open model to perform its task more reliably than any closed model.

In order to understand why this is important, we must first understand the two failure modes that LLMs are prone to. First, a model might lack the “world knowledge” necessary to perform a task. If you ask a small 1B model what the last Coldplay song released in 2016 was, it probably won’t know. If you ask a 235B model the same question, it probably will. The sheer difference in size between a model that takes up 2GB of storage and one that requires 470GB allows the larger model to hold more facts about the world in its weights. There is very little, short of a pretraining run, that can teach a small model new facts about the world. Fortunately, this failure mode is actually the less common case.

Generally, issues in LLM performance stem from a lack of reliability. A task may be well within the capabilities of a model to achieve some of the time, but not without occasional failures. LLMs are nondeterministic AI models and are famously unpredictable. For example, given the task of creating a meal plan that corresponds to a user’s stated needs, o3 might correctly generate a menu 95% of the time, Sonnet 4 might succeed 90% of the time, and Qwen 2.5 14B might have a success rate of 80%. Each model can do the task, but some mess up more often than others.

Through experimentation on real-world tasks, it’s become clear that the problem of reliability can be solved by RL. Using GRPO with a generalized reward model, small models frequently exceed the capabilities of o3 and Sonnet 4 at specific tasks. For the open source community and for developers at large, this means complete AI independence, forever. Your inference costs will decrease, your latency will decrease, and you will never lose access to the AI you rely on.

We’ve written and open-sourced ART to simplify the process of helping open models learn from experience and exceed SOTA performance. The ART client slides into any AI application built on top of an LLM and makes RL accessible to everyone. A number of companies have successfully used ART to replicate our success in their own closed environments.

Rather than an world of centralization and domination by a few large companies, we are entering an era of art and freedom. Now is the time to build our future.