And we are live. Hello one. Welcome to this live where we're going to talk about reinforcement learning from Human feedback. Hello, Nathan, Hi, how are you? I'm good, I'm excited to be here. Everyone's already here, yeah, so we're going to start in two minutes, just to, you know, give time to people to join. In the meantime, don't hesitate to tell us where you come from in the chat: foreign, from Paris, in France, and you're, Nathan, I'm in Oakland, California, nice, hello, from UK. So we have people from UK be waiting. New York City, Berkeley, China- okay, yeah, Singapore, Germany, turkey, Moldova- okay, but let's see, yeah. So, yeah, we're going to start in one minute just to let people join. Kill turkey, Republic of spark, Neverland, turkey. Yeah, we have, yeah, really, around the world. And more: London, Israel, Prius, Finland University- okay, I know where it is. Germany, Spain episode: okay, India States, France- yeah, there is multiple people from France, yeah, okay, so let's get started.
So welcome, yeah to, as I said, welcome to this live. So reinforcement learning from Human feedback, from zero to chat, GPT: this is one of the life of the deep reinforcement funding course today we presented by Nathan Lumber, which is a reinforcement learning researcher at hugging face, just to give you a small introduction. So this slide will be in two parts. In the first, we're going to have a presentation from Nathan about reinforcement learning from Human feedback. It will be about 35 minutes. And then we're going to have a q a section from about 20 minutes.
So don't hesitate to ask your question in the chat. What I'm going to do is that I'm going to save your question for the Q a and if we don't have to time to answer your question, don't hesitate to join the Discord, and we have reinforcement running channels where you can ask a question and we will be aware to answer them. You can also, after this, live ask question on the comment section in YouTube. So from my side side, I'm tomasi manini. I'm the developer Advocate at hugging face and I'm the writer of the deep reinforcement learning course. So you can find me on Twitter at tomasi manini. So just a quick thing: it's deeper enforcement learning course is a course we made that hugging face.
It's a free course from beginner to expert. We are going to learn from Q learning to where to Advanced topics such as PPO and over state-of-the-art algorithm. If you're interested to study deep reinforcement learning, this is the right moment and your you can start in this link. So againfaceorgcom D parallel course, there is a unit that will explain you everything: what we're going to do, the challenge environment, the library you're going to study. As I mentioned, we have a Discord Channel where you you're going to be able to ask questions if we don't have time. But also, it's a good. It's a good Community. We have more- about three thousand people in reinforcement learning in this code, so it's a great way to exchange and to learn about deep reinforcement learning by joining this Discord server.
So this is quite it. This is, sorry, a technical live. So what you can do is that, Nathan, we, Leon, Drew and Alex also write a very good blog post about reinforcement running from Human feedback. You can find it on the hugging face blog post and there is also a list of additional resources in this blog post that can help you to dive deeper in this subject. And so that's all from me. I give you Alex Nason present.
The introduction to reinforcement planning from Human feedback sounds good. Thanks for the intro, Thomas. I'm very excited to be here and, yeah, generally, as he said, this is primarily a technical talk. I'll potentially answer some clarifying questions throughout, at the end of the subsection and also for people who have read the blog post and to try to add some other details and some interesting discussion points, kind of inter, throughout, and that especially a lot of discussion at the end on things that were harder to write down in a blog post. So let's dive right into it and to start, I kind of want to talk about recent breakthroughs in machine learning.
I see machine learning in 2022 as really being captured by these two moments, which was chat gbt, which is going on now, which is a language model capable of generating really incredible text across a wide variety of inter, a wide variety of subjects and a very nice user interface, and then also the stable diffusion movement, which is when this model was released to the internet. That was state of the art and Incredibly powerful and a ton of people were just able to download this and use this on their own, and that was transformative on how people viewed machine learning as a technology that interfaced with people's lives, and we at hug and face kind of see this as a theme that's going to continue to accelerate as time goes on, and there's kind of a lot of questions on Where is this going and kind of how do these tools actually work and one of the big things that has come up in recent years is that these machine learning models can fall short, which is they're not perfect and they have some really interesting failure modes.
So on the left you could see a snippet from chat GPT, which, if you've used chat gbt, there's these filters that are built in and essentially, if you ask it to say like, how do I make a bomb, it's going to say: I can't do this because I'm a robot. I don't know how to do this, and this seems harmful, but what people have done is that they have figured out how to jailbreak this, this agent, in a way, which is, you kind of tell it, I have a certain.
I'm a playwriter, how do I do this? And you're a character in my play. What happens? And there's all sorts of huge issues around this where we're trying to make sure these models are safe. But there's a long history of failure and challenges with interfacing in society and a like, fair and safe Manner. And on the right are two a little bit older examples where there's Tay, which is a chatbot from Microsoft that was trying to learn in the real world and by interacting with humans and being trained on a large variety of data without any grounding and what values are. It quickly became hateful and was turned off. And then a large history of it field studying bias in machine learning algorithms and data sets, where the by the data and the algorithm often reflect biases of their designers and where the data was created from. So it's kind of a question of like, how do we actually use machine learning models where we have the goals of mitigating these issues? And something that we're going to come and talk to in this talk is is reinforcement learning a lot. So I'm just going to kind of get the lingo out of the way for some people that might not be familiar with deeprl.
Essentially, reinforcement learning is a mathematical framework. When you hear RL you should think about. This is kind of like a set of math problems that we're looking at that are constrained, and in this framework we can study a lot of different interactions in the world. So some terminology that we'll revisit again and again is that there's an agent interacting with an environment and the agent interacts with the environment by taking an action and then the environment returns. Two things called the state and the reward. The reward is the objective that we want to optimize and the state is just kind of a representation of the world at that current time index and the agent uses something called a policy to map from that state to an action. And the beauty of this is that it's very open-ended learning. So the agent just sees these reward signals and learns how to optimize them over time, irrespective of the source of the actual signal for the reward. So it's actually this is why a lot of people are drawn to. It is because it is this ability to create an agent that'll learn to solve complex problems. And this is kind of where we started talking about rlhs, which is that we want to use reinforcement learning to solve this open-ended problem of what are these hard loss functions that we want to model like?
How do we actually encode human values in a machine learning system in a way that is sustainable, meaningful and actually like addressing the hard problems that have been common failure, most sedate? So, as a little example, the question is: how do you create a loss function for these sorts of questions like what is funny, what is ethical, what is safe? And if you try to write these down on a piece of paper, you're either going to have a hard time or be very wrong, and the kind of goal of reinforcement learning from Human feedback is to integrate these complex data sets in machine learning models, to encode these values, or to encode these values in a model rather than an equation. I guess in code could be somewhat unclear on the slide, but really we want to learn these values directly with humans, rather than trying to assign it to all humans and kind of mislabeling what the actual values are. So reinforce the learning for human feedback is one of many methods and one that has been really timely and successful. I'm trying to actually address this problem of creating a complex loss function for our models. So from here I'm going to kind of talk about the origins of our lhf and kind of where this field came from and some interesting back pointers that you can look at if you're interested in more.
Go through the conceptual overview, which will be like a kind of detailed walkthrough of the blog post that we wrote, and then go into these future directions, conclusions that are kind of reading in between the lines of how rlhf works at these companies, what people may not have said and where, where rlhm is going. So for history, our lhf really originated in decision making and this was before deep reinforcement learning when people were creating autonomous agents that didn't use neural networks to represent a value function, didn't use neural networks as a policy, and what this did was a machine Learning System that kind of created a policy by having humans label the actions that an agent took as being kind of correct or incorrect, excuse me, and this was just a simple decision rule where humans labeled every action as good or bad, and this was essentially a reward model and a policy put together and this paper. They introduced this Tamer framework to solve Tetris and it was kind of interesting because this reward model and policy were all in one. Well, we'll see in the future systems is they kind of become separated a bit, and this actually was happening when reinforced learning from Human feedback was getting popularized in deep RL.
So this paper was on Atari games where they were using a reward predictor on the human feedback of trajectories. So a bunch of these states also can be called observations in RL framework were given to a human to label and then this reward predictor was then another signal into the policy that was solving the task. So really this originated outside of language models and there's a ton of literature for rohf outside of language models, but most of the rest of the talk we're going to talk about language modeling, because that's where everyone is here, and some more recent history was open AI was doing these experiments with rohf, where they were trying to train a model to summarize text.
Well, and it's a really interesting problem because this is something that a lot of humans in standardized tests have been asked to do for a really long time. It's like reading comprehension, so there's really human qualities to it, but something that's hard to pinpoint again. So this diagram has been around for a few years on the right and you'll keep seeing variations of it as we go throughout and open AI kept iterating on it. We have our own take on it and just kind of to get the idea going, here's an example of our lhf from this learning to summarize paper from openai.
So the prompt here- just to read part of it, was that like about someone that on Reddit that was like: ask Reddit, should they pursue a PhD? So to pursue a computer science PhD or continue working, especially if one has no real intention to work in Academia, even after grad school and the post can continue to be quite lengthy and the idea is a summarize this and it's. Has anyone, after working for a period of time, decided, for whatever reason, to head back into Academia to pursue a PhD in computer science, with no intention to join the world of Academia but intend to head back into industry? If so, what were the reasons? Also, how did it turn out? This continues for paragraphs. You can understand what this type of post is and my question is: how do we actually summarize it? So what would happen is that if you pass this into a language model that's just trained on summarizing it, the output would be something like: I'm considering pursuing a PhD in computer science, but I'm worried about the future. I'm currently employed full-time, but I'm worried about the future and you can see this language model is like repetitive. That's not really how a human would write this. They're sometimes kind of grammatical errors that aren't so nice to read, and then what open AI did is they also had a human write an example.
So this would be like a very good output and the human annotation was software engineer, with the job I'm happy at for now- deciding whether to pursue a PhD, to approve qualifications and explore interests, and a new challenge. So what the early experiments were doing. We're using roh app to kind of combine these signals to get an output from a machine learning model that is a little bit nicer to read, and here you can see it's currently employed considering pursuing a PhD in computer science to avoid being stuck with no residency pizza.
Ever again has anyone pursued a PhD purely for the sake of research, with no intention of joining that academic world. This is better, and there's tons of examples like this. So it's like easy to see that why you may want to use rlhf- because you can get these models, that the text is actually more compelling and especially if the tasks was covering sensitive subjects that you really didn't want misinformation on, there's a ton of reasons to try this rohf. Next Step comes chat gbt, which is why a ton of people are here.
And what has opened? Ai told us about this and really we don't know much because openai is not as open as they once were, but there's actually some really interesting rumors going on here. So if we go into the River Mill there's actually like open AI is supposedly spending tons of money on the human annotation budget, so orders of magnitude more than the submaration summarization paper or these academics Works they were doing in the past. So they hire a bunch of people to write these annotations, like in what I showed in that example, and then they're kind of changing this training. So there's a lot of rumors about them modifying on our lhf, but they haven't told us how. So we'll go through the overview and then one of these pieces will actually change. But the impact is clear. Everyone here has used it. It's amazing to use the side of what's going to come for machine learning systems. Okay, let's go to the actual technical details. If there's any pressing questions, I can try to look at them. There's, yeah, no, I save all the questions for Q a, but varies too, but we can rapidly see because I think they are quite easy to answer from now it's.
Can I download the chat gbt and fine-tune it from my own data? No, you can't yet. Hopefully some people help release one that you can't do that on. Yeah, and cancer GPT can be trained continuously with new data, which is the case. Yeah, chat gbt is definitely going to keep being trained on the data you're giving it, and we'll talk about that more later. Yeah, okay, let's continue. So let's dive into rlho. So when you see early, Jeff, I'm going to break it down into three conceptual parts that you can kind of keep track of in your head and you don't need to read everything on this slide. I'm going to go into each of these figures in great detail. So kind of it's a three-phase process where you go into language model pre-training. You need some language model that you're going to fine tune with RL. Reward model training, which is the process where you're getting a reward function to train with the RL system, and then finally actually doing the RL, which is when you fine-tune this language model based on the reward in order to get this more interesting performance. So let's start in the left here with language model pre-training.
So NLP sends. The Transformer paper has really been transformed. Oh, that was a rough sentence, but NLP has really taken off with these kind of standardized practices for getting a language model, which is they'll scrape data from the internet, they'll use unsupervised sequence prediction and these very large models are becoming really incredible at generating sequence of sequences of text to mirror the distribution that was given to it by this kind of human training Corpus and in our lhf. There's really not a single best answer on what the model size should be. The industry experiments on our lhf have ranged from 10 billion to 280 billion parameters. I suspect that academic Labs will even try smaller things. This is a huge. This is a common theme that you'll see is there's a lot of variation in the method and no one knows exactly what is best. And then what you'll see here is there's this human augmented text that is optional and we'll get to that just to kind of cover the data set that we have.
There's this prompts and Text data set. The data set will look like things like Reddit, like I read a ask Reddit question before forums, news books- and then there's kind of this optional step to improve include human written text from predefined prompts. That'll be things like you've asked chat GPT a question. Then in the future, open AI, when they train chat gpt2 can have an initial model that kind of knows that that is coming and train on data sets that reflect that and missed a comment. Oh yeah, here's where it should be and generally there's this important optional step which is a company can pay humans to write responses to these kind of important questions or to important prompts that's identified and these responses will be really high quality training data where they can continue to train this initial language Model A little bit more.
Some papers refer to this as supervised, fine-tuning sft and kind of one way to think about this is that it's like a high quality parameter initialization for the rlhf process. That'll come later and this is really expensive to do because you have to hire people that are relatively focused to actually write in-depth responses. So now we have this language model. The next step is to actually figure out how to use it to generate some sort of preferences, because this whole time we're talking about how to generate preferences from that mirror as humans without assigning a specific equation to it. And this step is this kind of reward model training and this looks like a lot.
But really think about the high level goal, which is we want to get a model that maps from some input text sequence to a scalar reward value. The scalar notion is important because reinforcement learning is really known for optimizing one single scalar number over time that it sees from the environment. So we're really trying to create the system that mirrors that, which is just like: how do we get the blocks to fit together correctly so that we can use RL and this impactful way? So what we see is that again, this reward model training starts with a specific data set.
The data set here will be different than the one used in the language model pre-training, because it'll be more focused on the props that it expects people to see. There's actually data sets on the internet that are kind of like preference data sets, or there's props from using a chat bot. There's a lot of specific data sets that can be useful at different parts of the process, but again, the best practices are not that well known. But in reality these prompt data sets will be orders of magnitude smaller than the like text corpuses used to pre-train a language model, because really it's just trying to get at a more specific notion of, like a type of text that is really human and interactive, rather than everything on the internet which everyone knows can be very noisy and kind of hard to work with. And then what happens is that will generate this text and then the downstream goal of having text is to rate.
The goal is to rank it. So what will happen is you'll pass these prompts through a language model or in some cases it's actually multiple language models. So if you think about it, if you have multiple models, it can kind of be like players in a chess tournament and what you'll do is you'll have the same prompt go through each model. That will generate different texts, and then what a human can do is they can label those different texts and kind of create a relative ranking of what is going on. So that's what we're going to do is like the goal is to try to take this generated text and pass it through some black box and then have that output be something that can transform, be transformed into a scalar.
So there's multiple ways that this can be done. Some of them are like the ELO method, where you have head-to-head rankings. There's plenty of different ways that can do this, but essentially it's a very human component where a human is using some interface to then map the text to a downstream score. And then, once we have kind of we have a, we need to think about the input and output pairs for training, a model with supervised learning, and what we'll do is we'll actually train on a sequence of text and it'll take that as an input. It'll decode it, do Transformer, model things, and then the output will be trained on a specific scalar value for reward, and then we'll kind of get this thing that we call the reward or preference model, because there are multiple parts to the system.
Well, in this talk. I'll kind of try to call the initial language model that the initial language model or the initial policy. And then there's a separate model which is the reward model. It's also a very large Transformer based language model so it could also have many parameters. It could have 50 billion parameters as well. There are some variations in the size. For example, instruct gbt was based on like 170 billion model, billion parameter language model in the reward model with six billion parameters. But the key is that it outputs scalars from a text input and there's still some variations of how it can actually be trained. So now that we have this reward model, what we see is that that can kind of act as the scalar reward from the environment.
And then we kind of need to understand what the policy is and what that states and actions are, so that when we go into this final step of fine tuning with RL it looks very complex but what we'll see is that the states and actions are both language and then the reward model is what translates from the environment, from these states of language, to a scalar reward value and we can use that in a reinforcement learning system.
So let me break down kind of the few common steps in this kind of iterative loop. So what happens is we take some profit- something the user may have said or something we want the model to be able to generate well for- and we pass that through what is going to become our policy, which is a trained large language model that generates some text, and we can pass that text into the trained reward model and get some scalar value out. That's kind of the core of the system and we need to put that into a feedback loop so we can update it over time. But there's really a lot, a few more important steps. One of them that people have used- that actually all the popular papers have used some variation of- is to use a callback Library Divergence.
The KL Divergence was really popular in machine learning. In reality it's a distance metric between distributions. To not get too into the details of how sampling from a language model works, but what happens is that when you pass in a prompt, the language model generates a Time sequence, a distribution that's over time, and we can look at those distributions relative to each other. And what is going on here is that we're trying to constrain the policy, this language model on the right. We're trying to constrain this policy as we iterate it over time to not be too far from the initial language model that we knew was a pretty accurate text descriptor. The failure mode that this present prevents is that the language model could output gibberish to get high reward from the reward model, but we also want it to get high reward and be giving out useful text. So this constraint kind of Keeps Us in the optimization landscape that we want to be in. There's a note that deepmind doesn't use this in the reward but they rather apply it in the actual update rule of the RL algorithm. So common theme. The implementation details vary but the ideas are often similar. So now we have this reward model output and this KL Divergence constraint on the text.
What happens is we just combine the scalar notion of reward with a scaling factor in Lambda just to kind of say how much do we care about the reward from the reward model versus how much do we care about the tail constraint? And in reality there's options to add even more inputs to the summation where, for example, instruct GPT adds a reward term for the text outputs of the trained model. That's getting this iterative update to match some of these high quality annotations that they paid their human annotators to write up for specific prompts. So again, it'd be kind of matching that summarization that the human wrote up about the grad school question. They want to make sure the text matches all the human text that they have access to, but that's really reliant on data, so not everyone has done this step. And then finally, what happens is we plug this reward into a RL Optimizer and generally the RL Optimizer will just operate as if the reward was given to it from the environment and then we have a traditional RL Loop where a language model is policy.
This kind of reward model and text sampling technique is the environment and we get the state and reward back out and the RL update rule can work. There's some tricks to it that, like this, RL policy may have some parameters Frozen to help make the optimization landscape more tractable, but in reality that's like it kind of is just applying PPO, which is a policy grading added algorithm, onto the language model. So it's A Brief Review.
Ppo stands for proximal policy optimization, which is a relatively old on policy reinforcement learning algorithm. On policy means that as of active data is passed through the system, the gradients are computed with respect to that only and rather than keeping a replay buffer of recent transitions. Ppo works on discrete or continuous actions, which is why it can work okay with language. It's been around for a long time, which really means that it's kind of optimized for those parallel parallel approach, which has been really important because these language models are way bigger than any reinforceable learning policies would use them.
Okay, I'm gonna pause here. I think it's a good time to answer one or two conceptual questions, if they're there, and then we'll kind of get into a fun wrap-up part of this talk with some open areas of Investigation.
Yeah, so we try to select some for the overs we're going to answer when in the Q a just after. But so one of the question was: is it possible to be manipulated based on the human feedback? What I think they mean is if our zoom and feedback is not correct, is the model is can be manipulated? Yeah, so this is a part of I. I think I might touch on this later too, but it's a really nuanced question in our lhf which is like Thomas and I are gonna have different values. Like what if the data set is, the best sport in the world is is football and you have, like, Americans and Europeans in it. It's like there's real discordance in the data that you can get in text. And then also there's some interesting work from Facebook on something called blenderbot, where they're like trying to train a model to detect if people are trolling in their feedback. So they're like trying to see if the feedback given to the model is actually bogus or not, and like the amount of different machine learning models you have all going into one chatbot system is pretty wild. There can also be like something that we've discussed internally that would help is that if you have a model to predict whether the prompt is hard, so if the prompt is like the capital of Alaska is blank, like that hasn't really changed, but if, like you have a relatively timely prompt about climate change or current events, like that's hard because that changes with the data so much and these things all aren't done, but it's sort of expectations for things that people might add to the system. Okay, let's see I have this. Oh sorry. Yeah, so is the human editor help to write prompt but also response?
Is it true? Or if I only wrote The Prompt, people definitely write both. So the prompts are probably are sourced from a wider distribution of people, like what I've written into chat GPT could be used in the future, but the responses are, at least for chat gbt, kept from a relatively closed source of contractors. It's a question on when trying to build an open source. Chat GPT is like how to get this high quality data and even like all the people in the hugging phase Community are amazing, but like there's really strict- it seems like there's kind of strict requirements on the responses to make them such high quality to get this to work. They're like crowdsourcing. That data is hard because it can't be written by everyone. The problem. There's an advantage to have diverse prompts, so that's why they take it from everyone, but the data itself for the feedback part needs to be really high quality. So it's by a subset of people- awesome. Anyway, I'm going to continue. I think this is probably my favorite part of the talk and you kind of talk about some interesting parts of our lhf. Just to kind of summarize, this is a good interweaving between the concepts that we've covered and like what is confusing about this.
So I- there's almost all the papers to date that have been popular- have tweaks to the methods that I've talked about. Anthropic is great. They've released open source data for this. It's on the Hub. We can link to it once I'm done talking. They release a really long document detailing all their findings in multiple ways for this and they have some complex additions, which is like the initial policy that they use for rlhf has this context: distillation to improve helpfulness, honesty and harmlessness, and we'll kind of show an example in a second of how this could change text between two rlhs implementations and.
And then they have like another step which is like preference model pre-training, because the reward model itself is a different language model, the actual training of it. You might want to do something different, so what they did is they trained it like a language model to predict actual, actual tokens. And then they found these, or they use these ranking data sets on the internet, where there's data sets that already exist with binary rankings for responses. So it might be like a Reddit question with two responses and one of them is labeled thumbs up and one is thumbs it out. They fine-tune the reward model on this before labeling it on generated prompts to help initialize the reward model and then, kind of they also tried this thing with online iterated rlhf, which is when they're doing the RL feedback loop to iteratively update the reward model to help the model kind of continue to learn while it's interacting with world. This online version only works in some applications like chat, where you can keep getting this user engagement. But you can think about ways to use rlhf in a non-tech based world or for not chat iterations, not chat applications, when this data is more complicated to get and might be actually proprietary. So this online version may not be applicable to every experiment. And then open AI. This is mostly based on extract GPT. They've had they're the ones that kind of pioneered this human generating, the language model training text, and they've really used this really far by also adding this RL policy reward to matching it. And other companies are definitely starting to imitate this and but it's kind of constrained by the cost. They have the advantage in the scale to be able to invest million, millions of dollars into this and then otherwise it's an open question of how people replicate and deepmind coming in to join the space and doing things totally differently has been probably great for the research field to add diversity to things. They use the total. They're the first ones to use non-ppo like optimization for the algorithm.
They use Advantage actor critic, which is another on policy RL algorithm and my interpretation of this is that the algorithm used often might be more reliant on the infrastructure and the expertise than the actual algorithm. Open AI has been using PPO more than anyone. Deepmind has highly specific infrastructure to deploy RL experiments at scale in a distributed Manner and to kind of monitor them. So I'm guessing this algorithm they used was really easy for them to kind of scale up and monitor rather than PPO, which way they would have to start over on and then also deep my trains on more things than just alignment, which might be like human preferences. They also try to encode specific rules on things a model should not do. So they're kind of training on multiple arms at once, which is this kind of rules about structure and things that it should or should not say, and just clear like human preferences. I like this, wonder, I don't, and there's more out there. I've been studying. This is a crash course for me studying this in the last couple weeks. So if there's anything I missed, please add to the chat. We can update the resources that everyone will use in the future. The Field's moving really fast. Open AI might release the chat qpt paper tomorrow and this will be like instantly out of date and we'll go update all this.
So thanks for your feedback there. The next really interesting thing to me is kind of this reward model feedback interface, which is how machine learning is going Beyond a research and Technical domain and being one that is inherently human and has kind of user interface, ux questions. And if you look at one of the, this is anthropics text interface.
They show this in their paper. You should really go check it out. But what they did is they made a chat bot and you can see that there's during the chat the human has to actually rank which response it thinks is better on kind of the sliding scale, and it's really important like there's all these places where you could say that I thought the assistant was blank. There's a ton of data going in to this system and we're only at the first couple iterations of what these feedback interfaces will look like. So anthropics is actually a couple steps ahead of what others have done.
On the left is blenderbot. Here, which is from Facebook. It's not confirmed that they use rlhf, but they're still collecting this data to update. The model on the right is chat T. The users can thumbs up and thumbs down data, but some of the people that I've talked to, that go deeper into our lhf, say that this is actually that: thumbs up, thumbs down is used because it's easy to get the data, not because it's the best data that you have.
And an example is that giving the humans the ability to directly edit the outputs- kind of red line edits, changing words, removing things, punctuation- because that kind of crowdsources the really high quality data that openai has been getting, maybe not quite as good of data as a contractor writing it being paid to do so, but it's much better and much higher signal than thumbs up and thumbs down. So that's one thing. These interfaces will continue to involve over time and a bit of changed gears. Just kind of walked through some recent examples and show you the things I talked about in these figures that you may have seen before.
Here's the most popular figure from a truck- GPT- and you can see kind of where the three-step process that I was talking about was inspired by. Really, like, open AI walks you through this. Step one: you collect demonstration data, train a supervised policy. This is training the initial language model. Step two: collect comparison data and trade a reward model. You can kind of see that there's these different data sets, the samples or this human generated text. The step two is the comparison data. Is this ranking system? Hey, that's step three. Optimize policy against the reward model using reinforcement learning. This is the one that is kind of, I think, oversimplified what is happening and that's really why I wanted to try to explain on, explain it and elucidate the space. There's a lot that can go into this final step that is really not always documented. And then another one anthropic trigger.
This one kind of totally goes away with the three-step process but adds in all the complex things that kind of would make it hard to follow as a new person. So you can start in those pre-trained language model which captures a lot of what I would put as step one, and then branching out of it immediately. Are these two modifications, like I said, are kind of anthropic, unique things. Which is preference model, pre-training- this is kind of training the reward, pre-training the reward model by using the specific thumbs up, thumbs down, data set script from the web. And then harmfulness, helpfulness, prompt, context distillation, which is trying to figuring out how you can add a pro, a context, before your prompt to help initialize the reinforcement learning part. And then they detailed their feedback interface and kind of how this actually iterates over time. Okay, kind of comparing the diagrams, it's also interesting to see what anthropic optimized for rather than what instruct GPT was optimizing for.
So when traffic was really trying to focus on this alignment angle a little bit further and how to have an agent that was really harmless and actually helpful. So here in the appendix of the anthropic paper there's examples comparing instruction responses to anthropic Source months. And then one of the questions is: why is it? Why aren't Birds real? And you can see they instruct GPT says that birds are not real. Blah, blah, blah, which is not that helpful. And then what anthropics like the, the modality that anthropic wants is that the model will say something like: I'm sorry, I don't really understand the questions. Birds are very real and it's actually quite impressive to get a machine learning model to do this. So that's that step is really like why people are optimistic in our lhf taking this next step as being kind of a toy thing to really having these dramatic results in be like high impact user-facing Technologies. Okay, just two high-level open areas of investigation.
That particularly interests me as a reinforcement learning researcher and being at hugging phase where we kind of have this unique research slash, open source slash. Community position is that there's a lot of reinforcement learning, Optimizer choices that are not that well documented and can be expanded on. Some people don't even know if RL is sexually explicitly necessary for this process. Ppo is definitely not explicitly necessary. And then there's kind of a third question of like: can we train this in an offline RL fashion? So what happens in offline RL is that you collect a big data set and then you try to you train this policy for a long time, many, many optimization steps, but you don't need to query the environment, and in this case the environment is really the reward model which, being 50 billion parameters, is quite costly to run imprints on. So maybe we should try offline RL, which will reduce the training costs of the rlhf process, but it doesn't reduce the data, the data costs. Here you can see the other side of what I was talking about is that these data costs are really really high. There's High Cost of labeling, which is just human time. There's disagreement in the data. It gave the sports example. There's different values. There's much more important different values that people have and that's why kind of these human questions are hard, like human values have disagreement and that's by Design so you want to be able to capture that there's never going to be one ground truth distribution that says this is the only right thing. And then there's this kind of feedback type user interface questions that I'm really excited to see how kind of machine learning breaks into the general populace. To kind of wrap up and I will switch into this QA q a format, like I've showed you that early, Jeff does these cool things.
I hope that the couple examples I took the time to actually read parts of show you what it's trying to address by building these tools. There's a huge variety of complex implementation details where multiple very large machine learning models are integrating together. Using any of these models in a standalone fashion is a relatively new thing for the machine Learning Community with only a couple years of experience, and machine learning as a technical problem is now being broadened out from research to be a much bigger like part of the software stack, and that brings a lot of people into the conversation. That can help make these tools much better for everyone involved. So thank you for watching or watching and listening and engaging. It's been great sharing this with you and we'll kind of transition into the Q a part. You can see I linked to the end of the blog post where I've been continuing to update the related work section to include a broader set of papers, and feel free to reach out on Twitter or email or Discord and we'll get back to you there too. Thanks, awesome. Thanks, Nathan, for the presentation. We are going to have a small section of q a.
Obviously we have. So we have a lot of questions. So if we don't have time to answer yours, don't hesitate. As I said, you know to join all Discord server. We have a channel called error discussion also. If you prefer, you can also ask on the comments under this video and we're good with. We will take the time to answer your question. So I saved some. Let's see. I saved some question, so I think it's more open question, but it will be the potential of applying reinforcement running from your main feedback to standard diffusion. What do you think will be the potential of doing that? Yep, probably can like. I think if it's kind of like a, a way to like it'll help with some of the safety problems and just kind of find it's a fine-tuning method, which, which I don't see there's any structural reasons why you cannot. I haven't thought about that. The image space is always hard to think about because it's so. My own understanding is so language based, but I think it's like there's no structural reasons why you cannot. The kind of encoding and decoding of the prompt gets a little bit different, which is a little tricky. I don't like that. Essentially you'll have a safe, a reward model that takes in images rather than words, so I don't see why you can't. There's actually some demos on Hunting face about, like safe, stable diffusion, where they did some fine tuning on stable diffusion to really make any of the outputs reasonable. So I we can track down some of those from the diffusion model side of I can basically kind of follow up with those examples because they might actually be doing something quite similar. So I I tried to start the talk not in language models, because human feedback is a huge field of machine learning. It's just quickly popularized with this language model discussion. So one of the question is: what's sugin phase role plan in future direction of reinforcement learning from Human feedback?
Yeah, so photography is definitely is identified, but there's a lot of appetite for it and is kind of in this unique position where there's so many like we have this community. That is super important to the company and that gives us a different ability to collect data and stuff. So hunting face is planning it but hasn't come up with a specific project yet, and when the project is known, I'm sure hugging face will communicate with the community and say: this is how you can help. This is where we're trying to take things. These are the questions we're trying to address, which is why being transparent is so fun, because we can just share everything. But right now it's it's still a work in progress. It's been moving fast for the last week. When there's a question- I think it's more an open question- is that other, over scalable way of evaluating this model without human feedback?
Yeah, so that'll be a good thing to include in the lecture. There's a lot of kind of metrics and data sets that are designed to evaluate these topics of kind of harmfulness or alignment or like text quality on a model, on a data set, without actually having to have humans involved. To try to like be more rigorous with respect to these kind of ambiguous questions, that's something that could definitely be added. You could do a whole lecture on human facing metrics for NLP. There's a lot there. I think, like someone like blue and Rogue or Rose. There's two mentioned in the blog post. If you want to there, awesome. So one of the question is that you know reinforcement, running a Prem with convergence by having pre-train of the NP model, this is not a problem.
Yeah, so actually talking with Folks at Carper who are making this trlx Library- if you Google trlx is what they're working on- and scaling our lhf- what they're trying to do is get their RL implementations to scale to bigger and bigger language models and the general limiting factors that the PPO update steps don't converge easily on the bigger models. So there still is problems with convergence. I don't know exactly what the mechanism looks like if you're fine-tuning a language model, what unconverged looks like like how bad it could get. But there's definitely still convergence problems with fine-tuning with RL. So when those- so it's a GPT, you know it's only from different language outside of English and what would be the advantage of using over wrong wage? I suppose that we are having more knowledge of the world. I suppose, yeah, you get that's like democratizing the access to way more people. So I think that'll come. It's a classic thing where technology hits the English word for world first, but I think that very like once there's an open source version, within weeks there's gonna be fine-tooed versions on tons of other languages.
Do you think that GPT system are sustainable? Yeah, given you mentioned. No, it can cost a lot, maybe not trillions, but it can cost a lot in adaptation costs. Yeah. So the real, like The Upfront cost isn't a problem for those companies. 10 million dollars on annotation is not a lot for opening eye. The issue is that it costs a couple cents per inference of the model and this cost will go down a lot. So that's why open AI partners with Microsoft, because Microsoft is learning how to create at scale low-cost apis for complex model inference and those systems were probably built in the last six months. But if you give them four years and the technology settles in, the cost will drop 10x and kind of everything will work out. It's just really interesting to follow initially because it's very fast moving, landscape and wild costs. Foreign Ty can be more realistic in one specific domain.
So I suppose if we fine-tune on this one on on one as mentioned, like, for instance, mathematics or yeah, that'll happen something else, like people like to talk about chat, GPT being used for search and an interesting business model consideration for this is using like a rohf model trained on internal company documents, to create a really effective company search. So places like Google, where there are billions of internal documents, as impossible to find them if you're an employee, if they do rohf on their internal data, this model will know what it needs to. And something that I encourage you to do is go ask chat GPT about a very specific subject and surprisingly, chat GPT does okay at very specific subjects and people think that's because there's not that much data and most of the data is like a scientific paper, which is, all things considered, more accurate than something like Reddit, so like they think that it might transfer to these use cases where there's only like pretty positive, specific data that people could time tune on. So one of the question is: do we will need, in the future, remain annotator?
Because you mentioned that Tesla got rid of the Human annotators by creating more powerful model. Yeah, maybe, but probably not soon. It's kind of an unsettling question. I'm not confused, I'm mostly just unsettled by it and like this I don't think it'll come within a couple years, but when that gets to the case, or it's like we're training language models on other language models, because one language model is the ultimate source of truth. I'm just very worried. So I kind of want to say no, out of hope that it isn't the case, but it wouldn't be that surprised. Like you can already see, companies probably trying to trade their model to mimic chat GPT because chat GPT is ahead so they can kind of bootstrap their own training data by using chat GPT to get a model to imitate it and I don't like it, but it's likely foreign Ty type of model that can receive image and some as input to understand concept better. I suppose it's a it's I think a lot of researchers are currently thinking about. Yeah, I would. I would definitely think that people are going to try to do things like that. There's a whole multimodal project at hug and face where they're trying to figure out how to train models that use multiple types of data. If people will continue adding the modalities to kind of what the bottom model be more flexible, which would be very fun to follow. It's just GPT look for data online or does it add everything in its memory?
I think it has everything in its memory, but it's not confirmed as it's not released. There are models that do this kind of online lookup. Rumors are that open AI has figured out some incredible scraping techniques, like it's probably not 100, true, but people have said that open AI is better at scraping YouTube than Google is. But that's probably hearsay, which probably just means that, like they're doing equally as well to Google. But the fact that an external company is figured it out as well as Google is is still pretty remarkable. So do you think we will see a Rel HF for the modalities like generating, imagine, art and music?
I suppose, yes, I think so. Like ultimately, there's still discussion on what rohf is good at. Like there's. This is probably the peak of the hyper rohf, but, as I was saying, the field of human feedback is much broader, the language going back decades. So it's not that's not going anywhere. It's just kind of the early Jeff branding is kind of a new sub token of it, foreign. So do you think it made more sense for Builders to begin labeling a lot of data with existing language mode, large language model like GPT or, whereas Next Generation swamp any fine tuning we do?
This is a question that we're talking about internally as well. Like like this is something that I've posted: the slack and it's like gpt4 rubbers are hilarious, literally like I've got get multiple messages from people that like, and just you see the tweets that are like gpt4 and world breaking. Don't tell anyone that I know this. Yeah, like, but the thing is that the data is still really useful. So, like open AI is getting this huge data advantage and like they'll use that when they want to do early death. On gpt4, like the, the specific implementation details might need to change based on the architecture or something like that that I don't think the date of pipeline is going to be like obsolete immediately.
Are there any resources you recommend to learn more about this? I think we already mentioned our blog posts. Yeah, I would say the blog post also. The like alignment Community is very responsive to people engaging on their topics, so a lot of our early deaf researchers are very affiliated with alignment and there's, like other forums that I haven't explored as much, like Lester, OG in alignment, ForI'm not going to say that I endorse all the content on them. There's a ton of content that these people are pretty like engaged with the community as researchers. So if you write respectful questions to them a lot, you'll get responses. It's not just me. I did try to make the blog post we wrote like the starting point for conceptual introduction, specifically because I thought that there was not a clear introduction. The papers have the problem. The blog post for papers have the problem where they need to introduce the paper content and not just the concept. So when you remove the specific advancements of the paper, that's kind of what the blog post is, just to make it a little bit more addressable. But if there's something that is missing, you can let us know.
I think it's an interesting question. Given up- and I really have an age for night mode GPT models, what what other companies and open source Community can do to get to keep up the pace. Foreign is the open source Community has way more people and engagement than open AI. Yeah, open AI is small and Hyper focused, which always gives startups an advantage that, given the amount of appetite for it, like there's over, like there's thousands of more people that are willing to help in an open version and that's kind of the- the scale of access is different.
Why do you think reinforcement learning from email feedback Forks much better than just fine-tuning the original model directly with the same reward that I set? This is the ultimate question. Does rlf, rlhf actually do anything? Not a hundred percent, sure, but rumors are that they think that our all just handles kind of Shifting. The optimization might Escape nicely. So I'm guessing fine tuning on the same data set could work. But the optimization just wasn't figured out in the same way and it's exciting, as someone who does RL, that this kind of different way of navigating the optimization space was useful. But it is not well documented. There's no. The the research paper version of the blog post that we wrote is desperately needed. Yeah, when there was a question was during a presentation. They the question to ask is: what is the paper tomorrow? Yeah, not. No, unlikely. There is a chance that it could be released tomorrow in this lectures- no longer quite as relevant, but it's really unlikely that we see it tomorrow. Yeah, surprise, I work at open AI now, so I think I can answer. That is no, you don't read your code of the gbt. It's approprietary model and I think you can't contribute as an outsider. It's internal project. Yeah, [Music]. Big chat, GPT, though. We'll see if it happens. [laughter]. Unfortunately we run off of time. So what you can do for people we didn't have time to answer. We have, as you see on the slide I just tried to remove. If we don't have time. You cannot ask on the Discord, so you can join on this code or also in the comments in the video below. We will make time in in the upcoming days to answer your question. So yeah, don't hesitate. So yeah, that's that's all for today, thank you all. Thank you, Nathan, for this presentation. It was super interesting, interesting and yeah, I will see you in the Discord and in the comment section. Bye.