JoyCaption, the open source image captioning model that I work on as a hobby, recently had its Beta One release, which was also the first release that used Reinforcement Learning (RL) to improve the model. RL has been a bit of a hot topic ever since the introduction of DeepSeek R1, and for good reason. But I believe people have a fundamental misunderstanding of RL and what its uses are, due in large part to OpenAI. In this article I want to shine a light on the mysteries of RL, as well as dive into specific details of how JoyCaption was run through the RL gauntlet. I love sharing my tools and knowledge with the community, to help others build cool stuff, which is why I’ve previously shared details on my models. So, with all that said, the first half will be dedicated to running through what Reinforcement Learning is and how OpenAI misled people. If you don’t care about that, or are already versed, and want to jump straight into the juicy details of training a model, skip ahead to “How RL was used in JoyCaption: Beta One”.

What Is Reinforcement Learning

The concepts around RL are often explained in obtuse and difficult to understand terms, when in reality, in the context of Large Language Models, RL is a very simple extension to the “normal” way of post-training LLMs. The fancy term for the normal way these models are trained is Supervised Finetuning (SFT), a term that has gained more popularity to help differentiate it from newer training methods. If you know how models were trained in the past, then you know what SFT is, which means you already know something about RL! This is because SFT is a subset of RL. With that in mind, the easiest path to learn RL is by starting from SFT and adding one little extension at a time until we get to full blown RL.

With traditional SFT you have a dataset of examples where each example is (prompt, response). During training those run through the model and the probabilities of the desired responses are driven up, using some form of gradient descent (AdamW optimizer, etc). In other words, we have a bunch of positive examples and drive the model to replicate them.

The first thing to note about traditional SFT is that the responses in the examples are typically human written. Either by gathering them from existing sources on the internet (e.g. StackOverflow, Quora, etc), or writing them manually. But it is also possible to build the dataset using responses from the model we’re about to train. You generate a bunch of candidate responses and pick the best, thus forming your dataset. This is called Rejection Sampling. The idea being that the model already has the desired response trained into it from pre-training, we’re just trying to weed it out from all the undesired responses.

Second, note that the dataset is built before training, known as “Offline” training. But if you’re doing Rejection Sampling you could, in theory, build the dataset on the fly as the model is being trained, known as “Online” training. Online training is neat, since as the model improves during training so do its responses, forming a beneficial feedback loop.

Third, note that the dataset is built entirely from Positive examples; examples of responses we want the model to write. If you’re doing Rejection Sampling, what about all the rejected responses? It would be useful to teach the model what not to generate by selecting Negative examples. Is that possible? Of course! Instead of driving the Negative examples towards 100% like we do for Positive examples, we instead drive them towards 0% probability.

Finally, imagine that we want to SFT a model but not too much. We want to be sure that the trained model is always mostly similar to the pretrained model. Well that’s easy: just also calculate the probabilities the original model would have predicted. Add a loss that drives our trained model towards those original probabilities. With some loss scaling we can balance the two forces and get a training loop that adjusts the model based on our targets while also keeping close to the original model. This second loss is called the KL Divergence loss.

Believe it or not, you now understand RL for LLMs. In fact, the only thing unique to RL is the use of Negative examples. Everything else can be used and still be considered SFT. You can do Online SFT with a KL Divergence loss, for example. Once you introduce Negative examples, you’re doing RL.

Putting it all together, you can do RL by just adding Negative examples and KL Divergence, and leaving Online training in your pocket to use when appropriate. There really is no more to it than that. There are more exotic forms of RL, but DPO and GRPO are the most popular recipes and this is what they boil down to. DPO works with pairs of responses for each prompt, one positive and one negative. GRPO works with groups of responses per prompt, assigning a scalar amount of positive or negative to each response. All that does is scale the loss for that particular prompt-response pair, so the model learns from very good responses and very bad responses more than it does from “meh” responses. GRPO also automatically adjusts these scalars throughout training (by normalizing them across the responses), which helps with training stability issues that were endemic with earlier forms of RL.

As for Online vs Offline, Offline is often easier to implement. With Offline you can build the dataset at your leisure, which is especially important when hiring humans to do evaluation and rating of responses. You can spend as much time as you’d like generating your dataset, tweaking it, etc. Online requires all of that work to happen on the fly; you can imagine how complicated and slow it would be to generate responses and have humans rate them for every batch during training.

So why bother with Online training? Models gain more performance when they are training on their own outputs. With Offline, even if all the responses in your dataset were generated by the original model, during training the model will slowly diverge and not be exactly the same. With KL loss and small training runs the difference in performance is negligible, which is why Offline still works in those scenarios. For longer training runs the divergence quickly becomes an issue. In Llama 3 the approach Meta developers took was called Iterative, where they did three smaller rounds of Offline DPO, generating a fresh dataset before each one. That strikes a good balance between the two approaches. It also affords the opportunity to inspect the model’s progress between rounds and adjust strategy if needed.

Before I end this section, let me give you two more terms that you’ll encounter when reading about RL: “Policy” is what the model being trained is called; “Judge” is what the thing rating the responses is called (which could be humans, or even another AI model). Finally, let me draw your attention to one important aspect of all this. Having the model train on its own output. While not necessary in a strict sense, it is all but required for RL to function. I cannot stress enough how important this is. The exact reasons why will be presented shortly.

How OpenAI Misled Everyone

For a long time in the LLM world RL meant RLHF (RL with Human Feedback) and there was this widespread belief that RLHF is only useful for improving the safety of models or preference tuning them. That belief has thankfully been evaporating since the introduction of DeepSeek R1, however I argue that RL is still deeply misunderstood. RL has been the backbone of every successful LLM since the beginning, not just the more recent reasoning models! More crucially I want to stress that RL is the foundation of building useful LLMs. In many ways, a functioning LLM cannot be built without RL.