Open Pretrained Transformers - Susan Zhang | Stanford MLSys #77

Bookmark
Summary
Questions Covered (beta)
x
Main point
Full script

Okay, hello everyone and welcome to episode 77 of the Stanford MLS seminar series. Of course, this quarter also partnered with cs324, advances and Foundation models. Today I'm joined by Percy and avonica, and today our guest is Susan from meta. Susan has a very exciting talk prepared for us today. She's going to be talking about the trials and tribulations of training opt 175b, so we're very excited to to welcome her and, of course, if you have questions, feel free to post them in the YouTube chat or in that Discord Channel if you're taking the class and with that, Susan, take it away. Thanks, Dan. Let me share my screen. Oh, where did that go? Well, that's unfortunate. Second. Ah, while we're waiting for that screen share, Michael has also showed up. Show more

Show less
Main point
Full script

Say hi, Michael, hi, Susan, cool. Can you guys see my screen? Yes, looks good, take it away, awesome, cool, yeah. So, hey, everyone, I'll be talking about the, the process of developing our 105 billion parameter opt model. This was a model that we released around May of last year, along with a suite of, you know, smaller Baseline models that went with it. So the I'll pretty much be covering kind of the entire journey of training this model. Show more

Show less
Main point
Full script

I'm starting from September all the way through around the end of December 2021.. The setup here was, you know, kind of a group of five Engineers tasked with training this one side billion parent model. In about three months we were given a thousand a100 gpus. These were the shiniest ones we could get our hands on at the time and with the training efficiency that we had. We needed about 33 days of continuous training to make our way through 300 billion tokens for the one side B model. The unfortunate bit was that we didn't really have a full-on infrastructure or systems team to support us, kind of outside of a customer support team from our cloud provider. There's also written the fact that we wanted to open source these models, so all of our infra was decoupled from kind of the internal production systems. The data that we had was kind of whatever we had lie around the lab at the time. You know, looking back, it would have been nice to spend some more time on the data side and give them kind of what we've seen since then with, you know, chinchilla and whatnot. It's very clear we were being under training these models but to test our infrastructure, let's make sure that we kind of knew what the feel was portray these models. We kind of had to just start then. The other sort of difficulty here was that the hyper parameters that we were sort of used to seeing within Fair was very different than what was published by- sorry, Microsoft and Nvidia with the Megatron touring paper, along with what open I released with the gpt3 lineage. So it was very unclear kind of what settings worked best, given that we also didn't have the training code bases for these other projects. So in October we- so I- kicked off our initial run just to use the existing setup that we had from different NLP groups within Fair, the. Show more

Show less
Main point
Full script

The reason for choosing that is that you know there was an existing empirical history for how these settings should work. So the sentiment was that you know, given that we've seen how these settings worked at smaller scales up until, I think, 13 billion parameters- hopefully they just were transfer up to 1.5 billion- it was very clear that didn't work out well. So one thing: the first thing that we changed was weight Decay. So initially most of the papers that were written from fair use the way decay of 0.01. But we noted that sort of starting from gp3 onwards, a lot of folks were using 0.1. That didn't work out well, lost our plateaui and then we started shifting some more hyper parameters. So we're changing gradient Norm clipping, we're changing atom beta 2.. We're also changing item Epsilon. But none of these things matter because in the end it turned out we had a bug in our code. So this is kind of like, you know, like AI practitioner one and 101, like you should be testing your code at smaller scales. Unfortunately, in the rush to kind of get started, we did not do that. So when we noticed, you know, using our code base to train our 125 million parameter model, it failed to converge, so clearly something was going on. So while we were fixing that code, I think the thinking was now we might as well be using this compute to sort of ablate some hyper parameter settings. Show more

Show less
Main point
Full script

So we started rolling back some of our changes, you know, going back to the original weight Decay value, testing whether or not we still needed gradient clipping, increasing warm-ups, so on and so forth. But by the time sort of our our code base for the tensor parallelism stuff was fixed, we at least settled on keeping clipping around 1.0. So that was run six. But that still didn't work out. So we keep going through a bunch of changes and eventually we tried even doubling batch size and the thinking there was that we also noticed this kind of some of our smaller ones. When we increase the batch size kind of, the variance of perplexity or loss kind of goes down between steps. But that wasn't much of a noticeable difference in in our run at the time. So thinking was we might as well keep a smaller batch size and, you know, make sure that our sample efficiency with each batch was still reasonable. Show more

Show less
Main point
Full script

So at this point I think this was already early November, around November 11th or so. So we're looking at the clock, you know we only the one month out of the three months that we had was already gone. So we settled on a configuration that we thought could work out well, that we're fairly confident in. So we keep a two million batch size. We keep our item States in fp32. At the time we're training this run with fe16. Looking back, probably B flow 16 would have worked out better. But the lineage of models that we had were all fb60 at the time. We use a tensor parallelism with 8X tanza parallel. We also did a bunch of data ablation. So one thing I didn't I haven't covered yet was kind of the the artifacts of the data set that we noticed were problematic. What happened was actually in our data duplication pipeline we ended up adding an extra Escape character. So some of our runs started having a very low loss and we got really excited until we realized kind of the model was just starting to predict Escape characters. So that wasn't great. So after fixing our data set, we were Zoom training with kind of the latest Corpus. We also changed our positional embeddings to be learned. But we weren't really confident in in using learn position embeddings either. So we kind of split the difference and used learn positional embeddings with sinusoidal in it to match the original Transformer paper. Same goes with kind of going splits these on weight Decay: instead of 0.1 we went with 0.05, since we weren't sure if it's 0.01 or 0.1. That was good learning rate. We aired on the side of going with a pretty high learning rate. So gpd3 used 60, negative five. We started with 3E negative four. The thingy here was also that at some point we'll probably end up having to drop the loading rate no matter what we do. So there's a limit to how far that can go. We might as well start with the highest learning rate that we can and then drop whenever Things become unstable. So there's a few more changes, like not having Dropout, adding in Dome former, since that was shown at the time to be quite promising at small scale. Show more

Show less
Main point
Full script

So architecture was just adding a bunch of layer Norms everywhere. We also had a graded pre-divided factor. That was kind of an artifact of how we were implementing sort of a local grade accumulation. This numerically made no difference. This was just to avoid overflow issues. So we just picked a number that seemed fairly reasonable to have and with gradient clipping we use 2.5 at the time to start, just to be rather generous here, since we also weren't sure if clipping was necessary, which we later found to be quite important. Show more

Show less
Main point
Full script

So here we also start indexing our runs, knowing that you know we'll probably have a bunch of restarts. So now we start going to decimals. So for the first five restarts here it was already pretty unstable very, very early on. I think this is showing the first 1.5 K steps. So to start it was very clear learning rate is probably too high. So we lowered it from three negative four to seven point five E negative five. That's a pretty big drop but it was still unstable. You see that massive Spike kind of near the end of the yellow run. So then we dropped gradient clipping to 1.5. That's the purple run that lasts for some time. Show more

Show less
Main point
Full script

And then finally we hit our kind of first major kind of horror failure, which is the famous uncorrectable ECC here. So we just restart the Run. Not much changes with kind of Hardware failures, and we also try to speed things up by reducing frequency of validation, and so that was accounted for the purple kind of restart after the gray and you see, that kind of that loss also explodes in our face. So we lower clipping again to to 1.0 and that's the Green Run near the end. So this is still only about 4.5 000 steps in. We had about 140 000 to go. So very, very early on a training things were already unstable. Now things get even worse somehow. So for the next five restarts I'll zoom into this. This is the covering kind of steps. Four thousand to 6.5 k. Our loss curve or perplexity, looks something like this, and so in this middle here- which I'll go into this. Show more

Show less
Main point
Full script

So our first restart here was the light blue. This is 11.6. To remediate that feeling, we started skipping batches when gradient Norm was much higher than 1.0. The thing you hear that maybe clipping was kind of distorting our item state. So maybe it's just better to skip batches. We also see this mentioned in the Palm paper from Google later on, where they do bash skipping in order to get around instabilities. That didn't work out too well. So then we changed, kind of our changed a new Shard of data, thinking that maybe if we just change the data set entirely in a way we can just get through some instabilities. We also increase weight Decay, we lower beta 2.. That is this first pink wiggle didn't last very long either, and so we roll back some changes. Show more

Show less
Main point
Full script

And so you'll kind of notice this pattern where we do a lot of changes and then start rolling back, kind of bisecting a bunch at the same time. Once again this goes back to being compute constrained. Even though we're using a lot of compute. We didn't have that much extra to, you know, run a bunch of ablations, so we have to make as many changes as we think could stabilize a run and then roll back when things don't look that well. So here we're keeping beta 2 at 0.95. We roll back way to K. We remove clipping that lasts a little bit longer. So the yellow versus the previous pink wiggle, but not that much. And then we add that clipping, we lower learning rate again. We try to match a previous run that lasts a little bit longer than the yellow, but not, once again, not by much. So at this point we noticed in the open source Community with the big science Workshop that there was this pull request from stas and it's quite interesting. Show more

Show less
Main point
Full script

It's a very simple change. There's this, you know: multiply by n factor and instead of multiplying by n, you just multiply by square root of n twice. Thinking is that this is improves numerical stability for large n. So we add this change in and launch our 11.10 run. Show more

Show less
Main point
Full script

That's the pink wiggle over here. Another change that we did to try to improve numerical stability was to remove gel you and use value instead. The thinking here is that the X Cube term and gel you may be causing instabilities as well. So we put these two changes together, since they're both aimed at improving numerical stability. At the same time we also start doing some other ablation on the side. None of these really panned out. So at this point you know, for the pink run, actually, so for for this right, even though it training continued, it wasn't clear that it was actually training anything. And one thing that we also look at is this loss scalar term. Show more

Show less
Main point
Full script

So with fp16 training one of the things that we apply is this lost scaling logic where instead of sort of you know for b416 what we actually noticed- that we can just train through some of these instabilities. Show more

Show less
Main point
Full script

But for XP 16 you'll hit overflows and underflow issues. So in order to get around small gradient problems potentially and cause like train to kind of stop, we can multiply loss by this loss factor to amplify gradients in cases where we have small gradients. So it's pretty simple: the logic is to scale up loss when we have an overflowed in a while and scale it down when we overflow quite a bit, and so when training grinds to Halt here, usually the Lost scale factor pretty much goes to zero and training stops. So in the case of the- the sort of you know, even though think things are progressing, go back, things were progressing in this, you know, sort of pink wiggle- it was clear that law scaling was so low that we weren't actually training much at all. Show more

Show less
Main point
Full script

So at this point, with all the changes that we've applied in the 11 dot X lineage, it wasn't clear that any of them would actually ensure that the model will converge. So we're- I think this is now mid-november at this point, November 16th, once again we needed 33 days to finish training and now we use 992 gpus instead of 1024. Since we've wisened up a bit and went through a few rounds of horror failures and noticed that we had to have machines sitting idle in order to swap and Bin when power failures occurred, and instead of kind of just going through this series of ablations to keep changing things one by one, we decided to just go completely sort of off the cuff and kind of match exactly the gp3 settings or Megatron settings that were published. So these settings were consistent with one another. So that gave us a bit of confidence in them working at scale. And obviously they were supposedly used to train the gbp3 model and also used to train the five thirty billion Megatron touring nlg model. So hopefully we were able to use them, even though we didn't have the exact training code ways they used. Show more

Show less
Main point
Full script

So, oh, there were some other things we changed, which was initialization. This is matching the Megatron codebase settings. We took out the layer noise. We're more former, going back to kind of the GB3 architecture exactly. We also had an embedding- embedding scaling term that kind of ensure that the embeddings would not be initialized to kind of gaussian with standard deviation of one that I think the thinking was that if you look at the gpd2 initialization, they initialize the standard deviation of 0.02 for token embeddings and 0.01 for positional embeddings, if I'm not mistaken. So we the thinking was that maybe a smaller standard deviation here would help improve stability and and we also matched, you know, weight Decay, clipping, Etc. So here is where we actually have a welcome change, which is all of our Hardware failures that we started seeing throughout our runs. Show more

Show less
Main point
Full script

So this is just the first 15 restarts we had a. Two runs fail with just lost gpus. Six runs fail with Cuda errors. Two runs fail with jobs just hanging without making progress. We'll have our nickel air and then sometimes training will just slow down and it was kind of unexplained for why that slowdown happened. We also had our own set of challenges for logic that we implemented to do checkpointing and storing intermediate State, along with non-determinism and how we recover the lost scalar. This bug actually ended up being a Saving Grace in some some cases where the non-determinism allowed us to train through some failures when we just simply restarted and the state was reset. So in general we aim to have full determinism. You really want to have the same exact numerical kind of equivalence, restarting from previous checkpoint, but in this case this turned out to help us later on. Show more

Show less
Main point
Full script

So at some point, of course, training became unstable. This is now. We've now made it past the first 6 000 steps at least, and we're now about 24 000 steps in the. The kind of metrics we look at the most are these four, roughly so. On the top left over here, this is the activation Norm, the, the last layer activation number for the soft Max. When that explodes we usually see loss explode as well. The to the right is the gradient Norms- same thing. These are all highly correlated values. When things are stable, we see a lot for for this run, gradient Norms usually was around 0.2. When things explode, they, they really explode vertically. Obviously. The loss or perplexity is is the probably the most important metric that we look at and of course, as mentioned before, the Lost dealer as well. So you see that these- the activation Norms, green dorms and law scalar- these are all highly correlated with one another. When things explode or crash, they all kind of get married in one another. Show more

Show less
Main point
Full script

So if I run 12.16, we still have a bunch of instabilities. So we reduce, clipping to 0.3, the reason that was chosen is, you know, going back to this previous graph, when things are stable it's very close to 0.2. So we decided to just aggressively clip whenever things became unstable, thinking that we can preempt some of these instabilities by doing so. We also had a backup plan of resetting Adam's date in case the the first moment, second moments, Were Somehow corrupted or drifts into some weird weird state. But we never end up actually going through with that. The next 17 restarts were mostly systems issues, so more the ECC errors lost to gpus, high dram, correctable errors, Etc. And we also end up exploiting the loss scalings, the non-determinism with restarts. So this was just kind of a whole. I mean in general the mixture of hardware issues and training. Like numerical conversations issues plague a lot of our largest runs. B flow 16 does help, but you know once again, at the time we're only using fp16 training. Show more

Show less
Main point
Full script

One thing that we did was that was kind of a bad idea in retrospect but we thought we might as well try, since these were unstable was to just go change the optimizer entirely mid training. So we try to be clever here by approximating sgde through changing the atom hyper parameters. So we said beta 1 is 0 and increase Epsilon to, like you know, say a large number- 100- and also increase learning rate by the same factor. It was kind of effectively equivalent to SGD. But unfortunately, in terms of the code, when we reloaded these states, beta 1 was actually not clobbered the way we thought it was being clobbered. So we actually ended up not really running SGD- fake SGD, as we thought we were running. So we actually swap out the code at that point for run 12.41. But we still had a bug with how we were calculating weight Decay and the learning rate was also not adjusted for sgds. The thinking was that I think there was a paper that came out- I forgot what, the which one it actually was, but I think SGD can tolerate quite a Higher Learning rate. So we didn't really address that while we were testing this out. So this didn't go very far. Show more

Show less
Main point
Full script

So at this point, you know, instead of doing anything exotic, which had to just do the honest thing, which was just lower learning rate repeatedly and our final learning array schedule- if you kind of squint looks a little bit, you know inverse Square rooty or cosigning or whatever you want to call it, but pretty much you know- start pretty high and then lower. There was a brief moment in between, around like step 60k, where we thought, oh, things are really unstable, we can probably increase the learning rate now. That didn't last too long and then we just kept lowering it two more times before we finished around step 140k. Show more

Show less
Main point
Full script

So this is just another kind of you know, when we are training these models we kind of just stare at tensorboard all day and the main graph we look at at the top, you know, is the loss curve. When things become unstable, you really do see the gradient Norms trying to drift High, high up. This is the bottom left plot- and then same with activation Norm. You see that kind of reversal in, in in the pattern trajectory of instead of going down it'll just suddenly start going up. So these kind of anomalies signal to us that something was going wrong and usually we'll have to come in and change the hyper parameter to try to get through it. Show more

Show less
Main point
Full script

So this is kind of showing the impact of changing just the learning rate here. So between the light blue and the in the green plot is just all we did was lower the volume rate. And so all of a sudden you see the gradient Norm, which is the middle graph on the first row. You see that it just goes back down. It doesn't Spike anymore. The law scale looks a little bit more sane. Activation Norm keeps going down. So just learning right alone seems to make quite a significant difference in the training stability of this run. Show more

Show less
Main point
Full script

So after 56 days of, you know, walking away on machine failures, restarting the training run, we finally end up with this, this plot. This is, I think, covering 53, 54 restarts, and there were some that were missing because we didn't back up our checkpoint and logs properly. So so in the end I mean one thing to keep it take note of is that the actual training Corpus for opt 175b was 180 billion tokens. That sort of was around step 93k ish was when we epoched, and so we still trained out to 300 billion tokens. But the the second half of that was all repeating the same data we saw on the first 50k steps. So I also have some more slides covering opt 66b. It's pretty much most more, more or less of the same, so I can go through those or can also just take questions right now. So I think we we still have some time, so I think it would be interesting to to see those. There are a few questions, but I would love to kind of hear the rest. Yeah, so, yeah. Show more

Show less
Main point
Full script

So op60 was funny that this run was actually more unstable to train than the one-sided B model. One of the reasons was also that we had half as many gpus now to train this model then once I would be. So this is all training, I think, with 512 gpus. So here it's the same thing. You know, you see activation Norms start spiking, the loss starts diverging with the orange run and one thing we also see is that, you know, clipping starts increasing quite significantly to mirror the kind of gradient Norm explosion. One thing we started tracking since the these runs is the cumulative clip frequency. We'll start noticing that in some runs we'll see these like phase changes where you'll clip quite frequently and then you will have no clipping after some time. We didn't do that here since you know it, at the time we weren't tracking this, this metric, but you can clearly divide the, derive it from your clip frequency. So this was: yeah, so this is, I think, pretty much for 20K steps in into the 66 billion training we once. It would just see more, more or less the same thing that we saw before: loss will start diverging, loss scalar will start crashing, activation will start spiking, and the main thing that we would do to remediate these issues is to just lower the learning rate. Show more

Show less
Main point
Full script

So in this case though, to start, you know, we've- we did try low in the clipping first and keep it at 0.3 to mirror the opt setting once I might be settings, I think since then we've also found that clipping really actually it's not clear how much it actually helps, since it just seems to delay instabilities as opposed to actually address them. But when this was, you know, the goal here was to mirror option would be as closely as possible. So 0.3 seemed like a reasonable threshold of dropped down too. This one, I think let me see. Show more

Show less
Main point
Full script

Yeah, so this is this- is at some point we actually end up restarting the entire run. So this is still noting the kind of the impact of just lowering clipping has. We see a significant difference in activation Norms and similarly here- this is when we were just talking the learning rate. Show more

Show less
Main point
Full script

So the learning rate is the bottom right plot and you see that just changing the learning rate alone can significantly change the sort of path of what the loss looks like, along with kind of the impact on activation problems as well. So this is a series of going from the red line, which is the highest learning rate, rolling back a little bit further to do the sort of white-ish run and then, that not being sufficient, and then lowering it again. So you also see how we choose where to restart from. Is is another part of this equation. Sometimes, when we try to get greedy and not restart far back enough, that ends up just hurting us in the end because at some point it starts exploding again. So we have to roll back quite significantly in some cases things. Yeah, so that was, and so this was the final actual training curve. Actually this is not the final one. Show more

Show less
Main point
Full script

This is the phone. This is the final training curve for the 66 billion run. So top right over here lost curve, kind of same thing. We ended up missing some logs in the middle, but overall you know quite a few restarts. Shape looks roughly the same as the ones that have a billion run activation Norm. There's a top left here and the pattern that we usually see for this is that you know it will actually go up to some point and then start coming down, kind of mirroring the learning rate Decay. When things don't follow that pattern, it's usually when things go wrong. So, for whatever reason, the pattern of activation Norm decaying over the course of training seems to be quite important. And then in the bottom right is the gradient norms and you see that drift up from 0.2 all the way to 0.25. Usually that should be pretty stable and this upwards trajectory is actually kind of a bad sign. With B flow, 16 training it's actually, you know, you see that, quite stable at like a constant sort of value, as opposed to this weird drift pattern. But you know, for this realm we kind of saw that near the end of training and then for law scaling, the, going back to fp16 training, most times if things are stable the value is like roughly around two to four, and when things are unstable, that's when we see that crash to zero. So near the end of this run, where the learning rate was low enough, you know, things were going quite smoothly for the last, you know, I guess 30k steps, but in the beginning it was actually quite difficult to get this to run. So that's pretty much it cool. This was a really, really interesting. I. I'm sure I see we we have a lot of questions. So, Susan, I'll go ahead and stop your screen share, and I actually see that Percy has his hand raised, so so I'll let him serve with the first question there. Show more

Show less
Main point
Full script

Right, cool talk. I mean I really admire kind of how you and meta decided to make all the the trip, trials and tribulations transparent, which, as far as I know, no one else does, so it gives us a really inside view into how these models are built. I'm wondering if you could say a bit more about if you had to rewind the clock and start today, like what is this thing you would do, knowing what you know now? Ah, a lot more data, and I'm spending a lot more time curating, processing, cleaning. Show more

Show less
Main point
Full script

I think that I think at this point you know, given this, this effort and kind of you know similar results from other labs, that trained model of the scale, like we kind of know the recipe and it's not, you know, it's not the architecture, the, the actual parameters, like it doesn't matter too much, it really comes down to the data. So had we trained with, say, 800 billion or 1.49 tokens, I think we would have ended up with a much better model. Just a follow-up on that a bit. So clearly we've learned that data matters a lot. So suppose you have the larger data set, would you not see some of the same training instabilities that you suffered through and also the hardware failures- and you know, maybe there's not much you can do about those, but it seems like there's still some decisions to be made about what exactly the architecture is. What are the parameters that you set? Yeah, so I think on that front with so from our observations after these runs using B flow, 16 was actually quite important to improve stability when lost would spike. It'll. Actually, if you're using before 16, we can continue training through that. Of course, eventually that can still become unstable, but that did help resolve a lot of the issues that just kind of the Lost scanner logic we had was just bandaging over. Show more

Show less
Main point
Full script

The other bit is, I think, initialization is quite important too. So there's been some work into looking at how to initialize different layers a little bit differently. I believe the the tensor program work coming out of sort of the Microsoft and open head collaborations of the zero shot hyper Planet transfer. Those are also important to take into position. Two, because I think the depth with aspect ratio isn't really clear when we scale up like what, how deep should these models be? We do know that the the more layers you add, the more unstable the training run can become. So these are a few things that could still, I think, make a difference, but overall, just I think the biggest difference really is FIFA 16.. All right, that's useful. Show more

Show less
Main point
Full script

More data, B float 16.. Got it? That's the secret. So there's a lot of questions in the Discord today, so so I'll try to or organize them and and get get everybody's question answered. One of the early ones that was quite interesting, I think, is: I think you mentioned early on that your team size was only like five people or something like that. Can you talk a little bit about why there was only five? Is it that somebody decided you only need five people to train a, a 175 billion model? Or there have been these tweets that are going around about there's only 200 people on on Earth who can train these models, and or, or were you one of those? One of those people, can you just talk? Talk a little bit about that? That dynamic I mean I. Show more

Show less
Main point
Full script

I mean there were a lot of collabers. Eventually we did end up bringing in more Engineers to sort of help stabilize Iran. So it wasn't just five the whole time, but I think to start there was just just a couple of us kind of looking at cluster issues and and part of the goal there was also to to de-risk that entire effort, which is a little bit diff, separate than just, you know, train and model, I think. Show more

Show less
Main point
Full script

In general, though, training itself I think it's just it is complicated when it comes to scaling, but it's kind of still a small piece of the development life cycle, right going back to like how important data is evaluation and everything go afterwards. So there were a lot of collaborators working on those efforts, but just for the core infra part, we didn't. We didn't really need too many people kind of getting getting involved, especially if we're just trading one run for many weeks or months. It would be kind of waste of time for, like you know, 20 people to just babysit one run. So gotcha, speaking of that, actually people are wondering how much time, like how much of your day, was spent looking at those loss curves. Quite, quite a bit, I mean. Yeah, considering how much compute we're using and the and obviously pressure to like succeed, there wasn't much else to do but to like look at the, the Tea Leaves of these curves wiggling up and down and, you know, hoping that we can catch instabilities before they, you know, before, wasting maybe like hours and hours of compute. Show more

Show less
Main point
Full script

I think, at some point, though, most of our days were dominated by Hardware debugging, infra issues, right, the the mysterious you know, Cuda errors and nickel heirs or throughput drops. I I kind of just glossed over how problematic that was, just, you know, by alluding to the fact that they existed, but really debugging them was also a whole level of of pain, since it was not clear- sometimes it was not clear- which machine had the issues. So we had to develop a bunch of cluster checks to bisect of the the pool of machines we had, look for which ones may have manifested some problem, and so I think the complexity of all that was quite challenging, since we have starting from a completely new cluster environment to do this gotcha. Another question that I'm seeing come up quite a few times is: how did you decide on what specific changes to make, like, how did you decide today we're going to increase the weight Decay, tomorrow we're gonna increase the decrease the learning rate? How did you make all those decisions? Yeah, at some point I think he noticed that there was only there weren't that many decisions that we could make. Show more

Show less
Main point
Full script

One is: rollback further, lower the learning rate. One is, you know, toggle clipping, one was toggle weight Decay and then, like the atom beta 2 change was actually like we don't really have the luxury of like really talking that much throughout training, since we had an issue with how we reloaded that value and stuff like that, so really we just had to commit to one value then and Just Go With It. The rest were all kind of aimed at addressing like gradient Norm spikes, gradient spikes, so the way to K1- I think we've also seen in Palm they use, like they derive, like the way to KSM function number of steps. So it's, it's. There's only like so many parameters surrounding your Optimizer that you have to play around with. So those were the couple that that we decided to go with. Yeah, gotcha, some people are asking about if you have thoughts about trying out different optimizers. Show more

Show less
Main point
Full script

So I know that there is this one that came out recently or like through genetic programming or something, and then there was a there's somebody linked to an Optimizer called Velo. Do you have thoughts about how you go about trying those? You know, testing any of those out? Yeah, the unfortunate bit here is like usually we see a lot of promise at smaller scales and you know you can try a bunch of different Optimizer. Show more

Show less
Main point
Full script

You can get wildly different results and get very excited and you know you think that this new Optimizer is the best thing ever. Unfortunately, getting these things to actually improve over the atom Baseline is actually quite hard at scale. So if if there's ever like additional hyper parameters that get added- it's not clear how that scales up- this goes back to like the whole zero shot hyper parameter transfer. We- we- at this point we know how to deal with Adam in some ways. I'm sure there's probably still a better algorithm out there that we're, that you know where we don't have right now. But there's just not that many like shots that we have to test new optimizers and that can change the training Dynamics completely. The worst is if something happens late in training where somehow you know loss stops going down, then you have to redo the whole thing. So there's quite a bit of like effectively loss aversion, right, yeah? Or you know aversion to actually changing things too much at some point. Since this stuff is so expensive to run. Can you give some color on kind of what that? Show more

Show less
Main point
Full script

You know the maybe risk aversion or you know thinking about those risks like can, can give some color on, yes, how you think about that. What would what? What would convince you to try like a different architecture or some something completely different and dedicate more, more resource, more resources. Now that you've kind of done the whole process with kind of all the standard things, yeah, I think the most important thing to look for is like this: the quote-unquote scaling laws of how that new architecture or that new algorithm works as you scale up. Show more

Show less
Main point
Full script

So we see a consistent pattern. Where that holds up as you scale up and the differences are still meaningful, then that's super promising and we would want to test that out. Usually, what I've seen, unfortunately, is that when you scale up certain things, the different the gains that you see at smaller scale just disappear. So if there's that diminishing returns aspect, then it's we. We have, yeah, way less chance of wanting to try that out, just in case it adds more instability or just the gains completely disappear. Let's see, is there kind of a magic point where those gains usually disappear, or is a you know method dependent? We don't have that many data points to know this I, but I do from the work that I've seen coming out of anthropic and Google. Show more

Show less
Main point
Full script

Right now it seems like the 22 billion to 50 billion, or actually 22 billion to 70 billion range is quite important and the hope is that anything like if things work at 70 billion scale, then it'll work at larger scales. We're still validating that ourselves too right now. So, yeah, hopefully we get a much better data point than just, you know, these couple of scales that have been trained in the industry. I see. Is there also a point where instability starts becoming a much bigger issue, like? So I've trained very small models like 125, like three, 350 up to like we we've trained personally up to like 2.7 billion and at those scales you can do almost anything and the model will will converge. Was there, did you notice? Especially as you're scaling up the smaller models? Show more

Show less
Main point
Full script

Is there a point where, like, there's a phase change where things become much more unstable? Yeah, I think so. So right now we're doing a lot of ablations at like 7 billion scale and actually some initialization changes can be quite impactful at that scale right now, but it's still like in in most cases, like throwing a few random hyper printers there, like things will just still work. I haven't noticed any. I think there's that this goes back to the gap between 70 billion and much larger models. Like it seems like there's still this unknown territory of like when things become super unstable around that, that range. So that's, I mean I guess I I don't have a really good answer other than like I think it's around 70 billion above, but not sure. Show more

Show less
Main point
Full script

So you mentioned you talked a little bit about all the hardware issues you're seeing. Can you talk about kind of how you checked for those and and the infrastructure you you put in place to kind of protect against them and and figure out when it's time to go, you know, play, figure out a new Rack or debug something? Yeah, so I mean this was just a lot of trial and error and also working with, like the cloud provider to sort of debug some of these problems. The, the one tool that we use a lot is there's, there's this library, library called GPU burn, so pretty much it just like lights up every part of part of your memory and then that can flag memory issues, so that one we use quite as often. We also look at, obviously, Nvidia SME, where I look at the number of uncorrectable errors or uncorrectable and correctable errors, and there's also MV link checks and so on. Show more

Show less
Main point
Full script

So I think there's more and more Diagnostics tools that are coming out as we sort of understand the Nvidia software stack a little better. But I think that's the ongoing thing since, like with different driver versions and and whatnot, that also changes over time. But I'll say it's a. I think with a100s the amount of failures we see is quite significant compared to b100s, and I'd probably expect that to increase with the next generation of Hardware too. Interesting I I have heard some- you know- chatter that the H's are can be hard to to get running as well. So we'll see. Hopefully the next version of opt won't be, won't be that that much worse? See, Percy has his hand raised again. So go ahead, Percy. Yeah, I wanted to ask about data, since you didn't really talk about that, both in terms of how I know you mentioned that the data set was basically whatever was lying around. Show more

Show less
Main point
Full script

Could you say a little bit more about what was lying around and what you felt was missing from what was lying around? Was it just you need more tokens or was there qualitatively different data sets that you wish you had put in? Yeah, so we, we reused a significant chunk of the Roberta Corpus and the main thing that was added is like refreshing some of the, the versions of data sets that was used there. Show more

Show less
Main point
Full script

We also looked at the pile and, I think, concluded I think we took subsets of the pile off to double check. It's it's been a while, but there were some issues with the quality there that we noticed especially. I think the pile by itself had do. It was purposely duplicated to have like different epochs of of different data sets kind of included, and so that also added like for us another additional deduplication stuff that we had to do to process this data. The other one was adding Reddit, and so that's this was the partnership with a kind of the blenderbot folks. So I think in general, it would have been nice to have a lot more books data right. I think that's like high quality Corpus with quite a bit of knowledge embedded. Show more

Show less
Main point
Full script

It would be nice to have more GitHub data code data archive as well. So I think what we're seeing now is that there's a lot of work going into, you know, especially with, say, like try gbt right, you see that these models are very knowledge based, very, in fact, hope, Amy, to be fashion correct, which you know, just random scrapes of the Internet won't provide you. So you really want to go for that high quality, like books, data sets, to sort of improve that performance? Did you use the? Show more

Show less
Main point
Full script

I know that the pile does have archive and GitHub and also some books, corpora, so were those used or do you mean, like, in addition to that, yeah, I think there's some issues with the formatting as well of how you parse some of these data sets and there's artifacts that may not be parsed correctly, so I'll have to double. I mean, like I think we just threw together the Corpus very quickly, so the exact full list of the 20 plus sources, like I I don't have, I don't have memorized. But yeah, okay, thanks, all right, there are a couple questions on how do you know when you've succeeded. So one person just asked, like, how did you define success for for this project? I think very in a very kind of naive way, which is like, did loss go down? Yes, and did the final Benchmark performance look reasonable? Ish, so in that, in that case, if it felt like a success, oh, we also since noticed that the way we were evaluating the model was a. You know we didn't do enough prompt tuning on the emails, so that would have also helped quite a bit. Show more

Show less
Main point
Full script

But yeah, I think it's tough because for us, I think the most important thing was the open source component and to work on the kind of inference efficiency that having a really, you know, a reasonable model at Large Scale would be helpful for, not just like throwing a bunch of random weights, you know, over the, over the wall and and with Zero Performance to measure right. So, knowing that all with all these images, efficiency work, you know you can make all these numerical changes and whatnot you know, massive pruning, sparsity, quantization and so on, the, the end result is to measure those changes with, like some basic NLP benchmarks to make sure that the mod wasn't completely lobotomized. So I think in that case that bit was probably more important than, say, having the state of the art model and everything, just because inference efficiency work can happen independently of that, I see. And and how much did you kind of compare while you're training, compare performance against, like the, the other major models that were out? So I think at the time I guess gpt3 was out- I don't actually remember what, what other models were out at the time, but did, did you do kind of like comparisons across models, like like that during the training as well? Okay, honestly no, but we should have. So around this early December was when we realized like we really need to be benchmarking these models and we didn't really have folks focused on that, so that was a whole separate effort to bring up the NLP benchmarks, to do this at scale, which is also quite challenging. Show more

Show less
Main point
Full script

So now we do, right now, for all of our major runs, we have, you know, benchmarks that run periodically on intermediate checkpoints. By the time, given the new, you know, cluster environment, everything we weren't doing that checks, those checks, the main thing we needed was just to have loss go down. As long as loss was going down and it looked better than the Lost curves for smaller models, that was the sanity check that we could, you know, rely on for now, basically what I see. I see in the in the Discord that there there's a follow-up to the data conversation with Percy earlier. Can you talk a little bit about what are kind of the or state-of-the-art the, the best in practice techniques for processing data and selecting data and kind of curating data nowadays? Oh yeah, so this one's tough because there's a lot of work going on into like data pruning, sort of efficiency around, like making sure we're selecting the data that we train on very carefully, since it's it's clear that we don't need the entire internet, you know, kind of blindly scraped, dumped into these models. At the end of the day, there's not a really good way to validate other than training models on new data sets and doing ablations of that sort, which are also quite expensive if you want to do that at a significant scale. Show more

Show less
Main point
Full script

So I think this is why the knowledge or, like the high quality Corpus, procuring that is the most important thing, because we can have some degree of confidence that you know having more books is probably helpful, right, having more papers in there is probably helpful, but trying to say, how do I, you know, call them the most important bits of common crawler Reddit out? Honestly, that's still kind of a bit of an open question. I think that is, and then that will make or break a lot of these models too. If that, if those data set is average, the quality is not good enough, you'll see a reflect in the model, and that's right now. Unfortunately, without kind of probing these models, you can't really tell. Maybe a follow-up to that is: how do you define data quality foreign? Show more

Show less
Main point
Full script

I think it really it comes down to the model you end up trading for which ends up being the composition of data that you use, and if you don't change that meaning significantly, you can't really tell so. So we did a bunch of ablations in at 70 billion scale with different versions of common crawl and we didn't notice significant differences because it, you know, kind of made the techniques we're employing wasn't didn't show strong enough differentiating signal. Show more

Show less
Main point
Full script

So I think it's much easier if you're sort of focused on training a model for a specific task, say for co-completion, I can really immediately see if the model can, you know, do co-generation as good as before, right, if you change it, data set in certain ways. But if it's just kind of this general purpose foundational language model, very vaguely defined, right, we don't really have a good way of saying, hey, this is how this quality can can improve the model. Now, ideally I think it would be okay if you train on, say, only half of the Corpus and reach the same performance, right. In that case, maybe that half is higher quality than you know the entire data set. So that could be one way of doing like relative measurements. But honestly, at this point we don't really have a good metric to follow. Yeah, it is a tricky question. Show more

Show less
Main point
Full script

I guess you mentioned books as being sort of high quality, because I guess people actually spend a lot of time making good books. Code, I guess, is another thing that people have found to be really helpful for reasoning capabilities and you see that with, like the Codex models, I I suppose also the sort of the, the entropy of your data, like how, if you have data which is basically like H, completely HTML templates with very little information content, I guess you would waste a lot of time just like going through tokens, which relates to also deep duplication, but so, but yeah, in general, it would be nice if we had a more principal way of doing data curation. There's a couple questions on YouTube that I wanted to go go to. So one question is: do you have any intuition about why you- so I remember early in the talk you you talked about you- tried some Higher Learning rates and then eventually went back to the you know the, the thing in Megatron, in the gpd3 paper. Do you have any intuition about why those Higher Learning rates maybe didn't work as well or were less stable? Ish, I think so. I mean, I think the biggest data point is that what we know so far is like lowering the learning rate helps improve stability. Show more

Show less
Main point
Full script

So there are many cases in which changing that one parameter alone and restarting would change the learning Dynamics or tree Dynamics entirely. If it's. I think having a model just start diverging- I mean always seen so far- is that if the learning rate is excessively High, you the training run just never, sort of never converges. So in that case I think there is this like sweet spot- and I believe there was the from open eye, The Bash Norm or not, batch of the, the gradient noise scale paper, where if you look at the gradient Norms you can sort of back out the optimal batch size. I think batch size is highly linked to the learning rate that is used. So these are all coupled in in some ways, I think. Well, I mean, honestly, I think this is just a huge trial and error kind of guess and check process and if at some point you change one thing and let things become unstable, you're like, okay, that's the wrong direction. So a lot of times we actually do go for the directional signal too. A lot of, when we do sort of the warm-up phase, like how long of a warm-up to go for, how long the learning rate to go for, you end up doing sort of really drastic changes to see if the the direction is correct and then you can like sort of tune the exact value afterwards. Roger, that's really interesting and I think really helpful for for for folks listening. There's one more question in the YouTube. So when you see kind of bugs or problems in the training, whether it's some instability or some Hardware failure, are you able to kind of reproduce those bugs in some other environment? How do you- I imagine there's so many machines and there's so many things that can go wrong- how do you like really drill down to something in particular? Yeah, reproducing. So this goes back to why determinism from restarts is so important: is that you really want to be able to reproduce the exact sort of numerical path taken to a certain insta unstable Point. Show more

Show less
Main point
Full script

But I think full-on bitwise determinism is, I think you're impossible to guarantee and furthermore, for horror failures, I think that's I mean in some cases we have a lot of transient Hardware failures too- nearly impossible consistently reproduced. So I think that that's what makes I feel a day, these these failures, like we don't have that much from mediation steps other than you know, restart, restart repeatedly, remove some machines, see if things become more stable, Etc. You become a little bit more clever about what machines you remove if it's Hardware related. You become a little bit more clever about what you know, say learning rate, you use if it's, you know, instability related. But there's not that many methods of addressing things, so that does constrain what we can do. But I think this also comes from just pure like in the case of hardware issues, right like I, I didn't really know that much about sort of what link, what the implication of Link flaps are, what the like mechanisms of detecting, like xid errors were right. So these are all things that come just through trial and error and through like slowly debugging these things over time. I'm sure with like h100s or the next generation of Hardware we've been using, say tpus, in the future there's an entire different stack to learn. So I think no, no good answer here other than just like pure experience and doing things repeatedly and hoping that you know, as you do, that you gain some experience so that I can do it faster next time. Right, one more question in from the YouTube. So for so, first I say thank you for the talk, which is of course a sentiment that that we Echo. So the question is, when you speak about FB 16 or B float 16 training. Do you keep fp32 weights around or is fp32 just for atom moments or what's kind of the, the exact breakdown of? Yeah, we don't use, so at this point we don't really use mixed Precision. Show more

Show less
Main point
Full script

Mixed Precision by keeping it, keeping a copy of FP 32 ways around, just because the implications of like memory requirements are too Highs. But as you keep stealing, for I do believe in some cases of b416 with fp32 weights is a bit more stable. But I think this is. You know, this is one of those things where it's like, depending on what scale you're iterating at, you could remove that or add it in, depending on how much memory you have at your disposal and how, how fast you want training to go. Okay, sorry, was the original question other than like keeping it. They also mentioned atom moments. Show more

Show less
Main point
Full script

Yeah, yeah, so a lot of moments. I think because of fsdp sharding, which splits the optimizer States across all gpus, there's minimal cost of keeping it in higher Precision, so you might as well. So I think this should go actually like, if you have memory despair, use higher Precision always. If you don't, then you know the first thing that go is like a duplicate copy of parameters. Next let go would be the actual maybe, maybe Adam States, but I think that's really rare unless you know you're trading with very few gpus. Gotcha. One other question that I think came up at some point was: so the obviously when you're training at these, these sizes, you have to do things like model parallel tensor, parallel pipeline, parallel, Etc at that at that size, like how much? What is basic? What is your like effective baptize on a single GPU? Are you still in a kind of a batch environment at all? Are you kind of a batch size one type thing. Can you give some color on on those training workloads? Yeah, usually batch size is around like 8 to 32 local batch size. I think this is just the balance of like wanting to split kind of your your your memory consumption roughly equally between your model weights versus your batch size. So the shapes line up well for batch size one. I think things might be it that that's that's usually when we have to go to tensor palazso that the effective batch size per GPU gets increased. Gotcha, another question from the class: do you have any intuition why the newer Hardware generations of GPU seem to be more unstable? You know, have an audio kind of what's causing them or you know what's going on. Yeah, I think this will have to. You have to ask, like an Nvidia Insider expert, about why that is. Show more

Show less
Main point
Full script

It would be, I mean, from, from my perspective, it would be nice if most of their code was open source so we can actually help, you know, debug this. But since it's not open source, it's very opaque, you know, kind of, and when we, when we do see Fields, like you know, in some cases we're seeing like xid errors that are like not even documented unless you kind of poke around at some of some hints of code that they've released. Show more

Show less
Main point
Full script

So I think, overall, I think I think it's just Co, complexity starts increasing over time. If it's closed Source, we there's only so much we can do and as these you know machines become sort of more, you know you pack more things in like I'm sure there's there's a lot more instability as you do that in the software around. It also has gotten more complex. So I think it's just, over time, the the growth of software and the maintenance of it, I'm sure there's there's a lot of cost about it. Great, Percy, go ahead. Show more

Show less
Main point
Full script

I wanted to ask a question, more, I guess, looking into the Futures. You've trained opt and you're working on other models and and meanwhile there are a few players in this space, you know, training models, including, of course, opening AI, but also coherential, AI 21 and and and so on. So I'm wondering, you know, if you had to State what your local mission is, what is the thing that makes kind of what meta is doing, or your team, maybe more specifically, kind of you, unique? Is it the open source and transparency or is it something else? Yeah, I mean, for me personally, I think it's really important to be open, sourcing all this knowledge and techniques. Show more

Show less
Main point
Full script

I think we're seeing that, you know, even though the industry- there's a lot of, you know, excitement in the industry and a lot of players training these models, kind of the ongoing experimentation- it's just becoming more and more closed over time and becoming more capital and resource intensive overall. So for me, I think it's- especially if we assume that this technology is going to impact, you know, the world in some way and that this is going to be developed, unfortunately, by just a select few- it worries me a lot that this knowledge gets sort of isolated and concentrated, with just very, very few people kind of dictating how it's going to be used in the future. So for me, I think opening this science up, making sure that we can develop this technology together, as you know, is pretty important. Show more

Show less
Main point
Full script

Hopefully, meta continues to fund this. I think that's, you know, for all these Labs. I think that's still an open question like how how far do we push open science here and what the what are the risks of doing this? I think there's also still so non-zero risks of being too open as well. So it's just very hard to to walk that line of you know how much to how to bring the world along in the in this development cycle and how and how to do so so most effectively, right? I? I think that's a you know a great, a great call to end on. Show more

Show less
Main point
Full script

I totally agree with you. I. I hope things stay, stay open and I'm sure anytime Percy would love to talk about training models together. Yeah, so the. That brings it to a wrap for for this week. Next week, I believe, is the last week of of our sets of talks that this quarter. So we we've been very excited by all the great talks that we've had so far. Thanks again to Susan for for coming today. It was a really interesting talk, really interesting to he care about all these trials and tribulations and and learn so much from you. Of course, also thank you to everybody watching on YouTube and in the class, and thanks for the great questions. Next week we've got yujin Troy from UW, who I believe is now a certified genius, and Jared Kaplan from anthropic, so so we're very excited to have them, and with that we will wave goodbye and say goodbye to YouTube. Show more

Show less