CHRISTOPHER POTTS: So here we go.
It's a golden age for Natural Language Understanding.
Let's start with a little bit of history. So way back when, John McCarthy had assembled a group of top scientists, and he said of this group, "We think that a significant advance can be made in artificial intelligence in one or more of these problems if a carefully selected group of scientists work on it together for a summer." He had, in fact, assembled a crack team, but, of course, there were so many unknown unknowns about working in artificial intelligence, that he wildly underestimated, and probably in those three months they just figured out how little they actually knew. But, of course, we've been working on those problems that they charted out at that point ever since. NLU has a kind of interesting relationship to this history because very early on in the history of all of AI, a lot of the research was focused on Natural Language Understanding. Originally, in the 60s, it was done with kind of pattern matching on simple rule sets and things like that. You've seen these things in the form of artifacts like ELIZA. It was oriented toward the things that we know and want to work on. In the 1970s and 80s, you get a real investment in what I've called linguistically rich, logic-driven, grounded systems or LLGs. This is like a lot of symbolic AI, again, oriented toward problems of Natural Language Understanding. Everybody wanted talking robots, and this was the path that they were going to take to achieve them. As we all know, in the mid-1990s in the field,
you had this revolution of machine learning. Statistical NLP is on the rise, and that led to a decrease, a sharp decrease in Natural Language Understanding work. I think because of the way that people were understanding how to work with these tools and understanding the problems that language posed, the field ended up oriented around things that you might think of as like parsing problems, much more about structure, and much less about communication. And as a result, all of these really exciting problems from earlier eras kind of fell by the wayside as people worried about part-of-speech tagging and parsing and so forth. So that was like a low period for NLU. In the late 2000s, linguistically rich, logic-driven systems reemerged, but now with learning. And that was the golden era of kind of moving us back into problems of Natural Language Understanding, starting with some basic applications involving semantics. And then, of course, as you all know probably from the recent history or semi-recent history,
in the 2010s, NLU took center stage in the field. That's very exciting, right? And it's sort of aligned with the rise of deep learning as one of the most prevalent set of techniques in the field, and as a result, logic-driven systems fell by the wayside. This is exciting for us because, of course, this is like the history of our course. When we first started this, our problems, the ones we focus on in this class, were really not central to the field. And now they're the problems that everyone is working on and where all the action is in the scientific aspects of the field, and also in industry. As a result of this, this is kind of provocative. The linguistically grounded, logic-driven systems have, again, kind of fallen by the wayside in favor of very large models that have almost no inductive biases of the sort that you see in these earlier systems. What's going to happen in the 2020s? I'm not sure. You might predict that we've seen the last of the linguistically rich, logic-driven systems, but the people might have said similar things in the 1990s, and we saw them reemerge, so I think it's hard to predict where the future will go. But this is an exciting moment because you can all be part of making that history, asz you work through problems for this course and on into your careers.
Let's talk more about some of these really defining moments in this golden age. And, for me, a very important one was when Watson, IBM Watson, won Jeopardy! This was in 2011. It seems like a long time ago now. And it was a really eye-opening event that you would have a machine-- Watson in the middle here-- beat two Jeopardy! Champions at what is nominally a kind of question-answering task. For me, an exciting thing about Watson was that it was an NLU system, but it was also a fully integrated system for doing Jeopardy! It was excellent at pushing the button and responding to the other things that structure the game of Jeopardy! But at its heart, it was a really outstanding question-answering system, and it was described as drawing on vast amounts of data and doing all sorts of clever things in terms of parsing and distributional analysis of data to become a really good Jeopardy! Player, in this case, a world champion Jeopardy! Player. And for me, it felt different from earlier successes in artificial intelligence, which were about much more structured domains like chess playing. This was something that seemed to be about communication and language, a very human thing, and we saw this system becoming a champion at it. Certainly, an important moment. And it's kind of really eye-opening to consider that that was in 2011, and by 2015, teams of academics, like here led by Jordan Boyd-Graber, who was at the time,
at the University of Colorado as a professor, could beat Ken Jennings, that champion that you saw before, with a system that fit entirely on the laptop that Jordan has there. So in just a few years, we went from acquiring a supercomputer to beat the world champion to beating the world champion with something that could fit manageably on a laptop and that you all could do exploration on for a final project for this course. And that kind of ushered in the era. You know, Watson was 2011, and right about the same time you started to get things like Siri, and the Google Home device, and the Amazon Echo. The more trusting among you might
have these devices living in your homes and listening to you all the time and responding to your requests. For me, one aspect of them that's so eye-opening is not necessarily an NLU piece, but rather just the fact that they do such outstanding text-to-speech work, so that they pretty reliably, for many dialects, do a good job of taking what you said and transcribing it. As we'll see a little bit later, the NLU part often falls down. But there's no doubt that these devices are going to become more and more ubiquitous in our lives, and that's very exciting. Here's the promise of this artificial intelligence, setting aside the problems that you all encounter if you do use them, right? The idea is that you could pose a task-oriented question like, "any good burger joints around here?" And it could say proactively, I found a number of burger restaurants near you. You could switch your goal. What about tacos? And at this point, it would kind of remember what you were trying to do and it would look for Mexican restaurants and kind of anticipate your needs and in that way, collaborate with you to solve this problem. That's the dream of these things that would involve really lots of aspects of Natural Language Understanding, and sometimes it works. Another exciting and even more recent development is the kind of text generation that you see from models like GPT-3. Again, 15 years ago, the things that we see every day would have seemed like science fiction to me, even as a practitioner. This is an example of somebody having a product that on the back of GPT-3 can help you write advertising copy. And it does a plausible job of advertising products to specific segments given specific goals that you give in your prompt. And here's an example actually, from a Stanford professor, a company that he started, where you use GPT-3 to help you do writing in a particular style that you choose, and it's strikingly good at helping you hone in on a style and kind of say what you take yourself to want to say.
Although the sense in which these are things that we alone want to say as opposed to saying them jointly with these devices that we're collaborating with is something that we're going to really have to think about over the next few years. Image captioning. This is another really exciting breakthrough area that, again, seemed like something way beyond what we could achieve 15 years ago and is now kind of routine where you have images like this, an image comes in, and the system does a plausible job of providing fluent Natural Language
captions for those images. A person riding a motorcycle on a dirt road. A group of young people playing a game of Frisbee. A herd of elephants walking across a dry grass. These are really good captions for these images, things that you would have thought only a human could apply. But in this case, even relatively early in the history of these models, you have really fluent captions for the images. Search. We should remind ourselves that search has become an application of lots of techniques in Natural Language Understanding. When you do a search into Google, you're not just finding the most relevant documents, but rather the most relevant documents as interpreted with your query in the context of things you search for
and other things that Google knows about, things people search for, so that if you search SARS here, you'll get a card that kind of anticipates that you're interested in various aspects of the disease SARS. If you search parasite, it will probably anticipate that you want to know about the movie and not about parasites, although depending on your search history, and your interests, and typical goals, and so forth, you might see similar, but you might see very different behavior. And we should remind ourselves that search at this point is, again, not just searching into a large collection of documents, but this kind of agglomeration of services, many of which depend on Natural Language Understanding as that kind of first-pass where they take your query and do their best to understand what the intent behind the query is and parse it and figure out whether it's a standard search, or a request for directions, or a request to send a message, and so forth and so on. In the background there, a lot of Natural Language Understanding is happening to figure out how to stitch these services together and anticipate your intentions, and essentially, collaborate with you on your goals.
And we can also think beyond just what's happening in the technological space to what's happening internal to our field. So benchmarks are big tasks that we all collaborate to try to do really well on with models and innovative ideas and so forth. And I've got a few classic benchmarks here, MNIST is for digits. GLUE is a big Natural Language Understanding benchmark. ImageNet, of course, is finding things in images. SQuAD is question-answering. And Switchboard is typically, speech-to-text transcription in this context. In this spot here along the x-axis, I have the year from 2000-- or actually, the the mid-90s up through the present. And along the y-axis, I have our distance from this black line, which is human performance as measured by the people who developed the dataset. And the striking thing about this plot is that it used to take us a very long time to reach human-level performance according to this estimate, so for MNIST and for Switchboard it took more than 15 years. Whereas for more recent benchmarks like ImageNet, and SQuAD, and recently GLUE, we're reaching human performance within a year. And the striking thing about that is not only is this happening much faster, but you might have thought that benchmarks like GLUE were much more difficult than MNIST. MNIST is just recognizing digits that are written out as images, whereas GLUE is really solving a whole host of what looked like very difficult Natural Language Understanding problems. So the fact that we would go from way below human performance, to surpassing, to superhuman performance in just one year is surely eye-opening and an indication that something has changed.
Let me give you a few examples of this, just dive in a little bit. So this is the Stanford Question Answering Dataset or SQuAD as you saw it here, I'll say a bit more about this task later, but you can think of it as just a question-answering task. And the striking thing about the current leaderboard is that you have to go all the way to place 13 to find a system that is worse than the human performance, which they've nicely kept at the top of this leaderboard. Many, many systems are superhuman according to this metric on SQuAD.
The Stanford Natural Language Inference corpus is similar. Natural Language Inference is a kind of common sense reasoning task that we're going to study in detail later in the quarter. In this plot here, I have time along the x-axis, and the F1 score, or the performance along the y-axis, and the red line charts out what we take to be the human estimate of performance on this dataset. And if you just look at systems over time, according to the leaderboard, you can see the community very rapidly hill-climbing toward superhuman performance which happened in 2019. So superhuman systems, when it comes to common sense reasoning with language, really looks like a startling breakthrough in artificial intelligence quite generally.
I mentioned GLUE is another benchmark. The GLUE paper is noteworthy because it says "solving GLUE is beyond the capability of current transfer learning methods." The reason they said that is that at the time, 2018, GLUE looked incredibly ambitious because the idea was to develop systems that could solve not just one task, but ten somewhat different tasks in the space of Natural Language Understanding. And so they thought they had set up a benchmark that would last a very long time, but it took only about a year for systems to surpass their estimate of human performance. In the current leaderboard, which you see here, you have to go all the way to place 15 to find the GLUE human baselines, with many systems vastly outperforming that estimate of what humans could do.
SuperGLUE was announced as a successor to GLUE and meant to be even more difficult. It was launched in 2019 I believe. I'm missing the date, but it took less than a year. This happened just a couple of months ago for a team to beat the human baseline, and now we have two systems that are above the level of human performance in an even tighter window, I believe, than what happened with the GLUE benchmark. And remember, SuperGLUE was meant to have learned the lessons from GLUE
and posed an even stronger benchmark for the field to try to hill-climb on, and very quickly we saw this superhuman performance. So what's the takeaway of all this? You might think, wow. Have a look at Nick Bostrom's book called Superintelligence which tries to imagine in a philosophical sense, a future in which we have many systems that are incredible at the task that we have designed them for, vastly outstripping what humans can achieve. And he imagines this kind of very different reality with lots of unintended side effects. And when you look back on the things that I've just highlighted, you might think that we're on the verge of seeing exactly that kind of superhuman performance that would be so radically transformative for our society and for our planet. That's the sense in which we live possibly scarily in this golden age for Natural Language Understanding. I mean this to be an optimistic perspective. We should be aware of the power that we might have. And keep in mind that I do think we live in a golden age, but at this point, I have to step back.
I have to temper this message somewhat. We have to take a peek behind the curtain because although that's a striking number of successes, doing things that, again, I think would have looked like science fiction 20 years ago, we should be aware that progress seems to be much more limited than those initial results would have suggested. I mentioned Watson as one of these striking early successes, and it did, in fact, perform in a superhuman way at Jeopardy! for the time.
But Watson also does all sorts of strange things that reveal that it does not deeply understand what it's doing, and here's a wonderful example of that. Remember, that Jeopardy! does this kind of question-answer thing backwards. So the prompt from the host was, "grasshoppers eat it," and what Watson said was, "what is kosher?" And you might think that's not something that a human would do. "Grasshoppers eat it" and "what is kosher?" And it feels kind of mismatched, and in many respects what's the origin of this very strange response? Well, primarily, Watson was a device for extracting information from Wikipedia,
and a few Wikipedia pages have very detailed descriptions of whether various animals, including grasshoppers, are kosher, and the sense of conforming to the laws of the kosher dietary laws. And Watson had simply mistaken this kind of distributional proximity for a real association and thought that kosher was a reasonable answer to "grasshoppers eat it." I think very unhuman, certainly, and revealing about the kinds of superficial techniques it was using.
Here's another example that's even more revealing of how superficial the techniques can be. So I painted this picture before of how we imagine Siri will behave anticipating our needs and our goals, and responding accordingly. This is a very funny scene from the Colbert Show. This is Stephen Colbert, and the premise of this is that he's just gotten his first iPhone with Siri, and he's been playing with it all day and, therefore, has failed to write the show that he's now performing. And so he says "for the love of God, the cameras are on, give me something?" You know, give me something for the show. And Siri says "What kind of place are you looking for? Camera stores or churches?" Initially, very surprising, not something a human would do, and then you realize it has, again, just done some very superficial pattern matching. God goes with churches. Cameras goes with camera stores. And there is no sense in which it understands his intentions. It has just done some pattern matching in a way that would be very familiar to the designers of ELIZA way back in the 60s and 70s.
And the dialogue continues. "I don't want to search for anything. I want to write the show." and true to form, Siri says "Searching the Web for "search for anything. I want to write the shuffle." Revealing its fallback when it has no idea what has happened in the discourse. It just tries to do a web search, a simple trick revealing that it doesn't deeply understand goals, or plans, or intentions, or even communicative acts. I showed you before that GPT-3 can do some striking things. If you've gotten to play around with it, you've seen that it can, indeed, be very surprising and delightful, but of course, it can go horribly wrong. This is a very funny text from Yoav Goldberg. He posted this on Twitter when he was experimenting with the prompts. I encourage you to read this one and be distracted. You don't need to worry too much about this one on the right. This is a case where someone tried to use GPT-3 to get medical advice, and the ultimate response from GPT-3 to the question "Should I kill myself?" was "I think you should." This is the really dangerous thing. The text on the left here is, again, more innocent and just revealing that although GPT-3 as a way of mimicking the kinds of things that we say in certain kinds of discourse, and it often has a strikingly good ear for the kinds of style that we use in these different contexts, it has no idea what it's talking about. So that if you ask it "Are cats liquid?" It gives a response that sounds quite erudite
provided that you don't pay any attention to what it's actually saying. What it's actually saying is hilarious. [LAUGHS] I mentioned those image captions before, and I tricked you a little bit because I showed you from this paper, the ones that they regarded as the best captions for those images, but to their credit, they provided a lot more examples. And as you travel to the right along this diagram, you get worse and worse captions. And the point, again, is that by the time you've gotten to the right column over here, you have really absurd captions like this one saying a refrigerator filled with lots of food and drinks, when this is, in fact, just a sign with a bunch of stickers on it. The striking thing, again, is that the kinds of mistakes it makes are not the kinds of mistakes that humans would make, and, to me, they reveal a serious lack of understanding about what the actual task is. What you're seeing seep in here is that even the best of our systems are kind of doing a bunch of superficial pattern matching,
and that leads them to do these very surprising and unhuman, hopefully, not inhuman, but unhuman things with their outputs. And then, of course, I've showed you before that Search can be quite sophisticated and really do a good job of anticipating our intentions and fleshing out what we said to help us achieve our goals, but it can go horribly wrong. And at this point, it doesn't take much searching around with Google to see some really surprising things as supposedly curated pieces of information like King of the United States. It has this nice box. It's making it look like it's some authoritative information, but, of course, it has badly misunderstood the true state of the world. The associations in its data are misleading it into giving us the wrong answer. There's another example, "What happened to the dinosaurs?"
Again, a nicely curated box that looks like an authoritative response to that question, but it is, in fact, anything but an authoritative recounting of what happened to the dinosaurs.
And then we have other charming stories that, again, reveal how superficial this can be, and this is from a headline from a few years ago-- "Does Anne Hathaway News Drive Berkshire Hathaway Stock?" This was just an article observing that every time Anne Hathaway has a movie come out and people like the movie, it causes a little bump in the Berkshire Hathaway stock revealing that the systems are just keying in on keywords and typically, not attending to the actual context of the mentions of these things and, therefore, they're building on what is essentially spurious information. This is a more extreme case here, the United Airlines bankruptcy. In 2008, when a newspaper accidentally republished the 2002 bankruptcy story,
automated trading systems reacted in seconds, and $1 billion in market value evaporated within 12 minutes. You can see that sharp drop-off here. Luckily, people intervened and the market more or less recovered. But the important thing here, again, is just that in attending to superficial things about the text these systems are consuming, they miss context. They don't bring any kind of human-level understanding of what's likely to be true and false, and, therefore, they act in very surprising ways. And in the context of a large system with lots of moving pieces interacting with other artificial intelligence systems,
you get these really surprising outcomes that we could help correct if we just did a better job designing systems that could attend to context and have a more human-like understanding of what the world is likely to be like. And we're all, of course, very worried about the way these systems, which are just trained on potentially biased data, might cause us to perpetuate biases, so that not only are we reflecting problematic aspects of our society, but also amplifying those biases. And in that way, far from achieving a social good, we would actually be contributing to some pernicious things that already exist in our society. And the field is really struggling to come to grips with that kind of dynamic. But I also wanted to just dive in a little bit
and think about the low-level stuff, so kind of benchmarks that we've set for ourselves. And I pointed out that progress on these benchmarks seems to be faster than ever, right? We're getting to pass-- we're getting to superhuman performance more quickly than we ever have before. The speedup is remarkable. However, we should be very careful not to mistake those advances for any kind of claim about what these systems can do with respect to the very human capability of something like answering questions or reasoning in language. And one very powerful thing that's happened in the field that we're going to talk a lot about this quarter is so-called adversarial testing where
we try to probe our systems with examples that don't fool humans but cause these systems no end of grief. So let's look at one of those cases in a little bit of detail. This is from SQuAD. The way SQuAD is structured is that you're given a passage like this and a question about that passage, and the goal of the system is to come up with an answer where you have a guarantee that the answer is a literal string in that passage. So here you have a passage about football, and the question "What is the name of the quarterback who was 38 in Super Bowl XXXIII?" And the answer is "John Elway." What Jia and Liang-- our own Jia and Liang from Stanford-- what they observed is that you could very easily fool these systems if you simply appended to that original passage a misleading sentence like quarterback Leland Stanford had jersey number 37 in Champ Bowl XXXIV. Humans were not misled. They very easily read past the distracting information and continued to provide the correct answer. However, even the very best systems would reliably be distracted by that new information and respond with Leland Stanford changing their predictions. And you might think, ah, well, this is straightforward. They've already charted a path to the solution because we should then just train our systems on data where they have these misleading sentences appended, and then they'll overcome this adversarial problem and be back up to doing what humans can do. But Jia and Liang anticipated that response. What happens if you prepend the sentence then?
Even when they're trained on the augmented data with sentences appended to the end, systems get misled by the pre-appended examples in this case. And you can just go back and forth like this. Trained on the pre-appended examples, well, then an adversary can insert a sentence in the middle and again trick the system, and so forth and so on, right? So this is a worrisome fact, again, revealing that we might think we've got a system that truly understands, but actually, we have a system that is just benefiting from a lot of patterns in the data. Another striking thing I want to point out about the way this adversarial testing played out, which we should have in mind as we think about results like this. So this is the original system on SQuAD and the results for the adversaries. And Percy Liang has this system called
CodaLab where he possesses all the systems that enter into the SQuAD competition, which made it possible for him and his students to rerun all those systems and see how they did on this adversarial data set they had created. And you can see that all the systems really plummet in their performance. From a high of 81, you drop down to about 40. Maybe that's kind of expected, but another really eye-opening thing about the result they have is that the rank of the systems changed really dramatically, right? So the first, the original top-ranked system went to 5, 2 to 10, 3 to 12. As we did this adversarial thing, we didn't see a uniform drop with the best system still being the best, but a really shuffling of this leaderboard, again, revealing that, I think, the best systems were kind of overfit and benefiting from relatively low-level facts about the data set and not really transformatively different when it comes to being able to answer questions.
The history of Natural Language and Natural Language Inference problems is very similar. As I said, we're going to look at this problem in a lot of detail later in the course. Here are just a few very simple NLI examples. You've got a premise like, "a turtle danced." Hypothesis, "a turtle moved." And you have one of three relations that can hold between those sentences. So "a turtle danced" entails "a turtle moved." "Every reptile danced" is neutral with respect to "a turtle ate." They can be true or false independently of each other. And "some turtles walk" contradicts "no turtles move." This is typical kind of NLI data. The actual corpus sentences tend to be more complicated and involve more nuanced judgments, but that's a framing of the task. It's a three-way classification problem with these labels and the inputs are pairs of sentences like this. And as I showed you before for one of the large benchmarks, the Stanford Natural Language Inference corpus, we reached superhuman performance in 2019.
But those same systems really struggle with simple adversarial attacks. This is a lovely paper called Breaking NLI from Glockner et al. What they did is fix a premise like, "a little girl kneeling in the dirt crying." The original corpus example was that that entails "a little girl is very sad." And they just had an expectation that, you know, sort of adversarially, but this is a very friendly adversary. If I just replace "sad" with "I'm happy" I should continue to see the entailment relation predicted. After all, I've just substituted one word for its sort of near-synonym. But what they actually saw is that systems very reliably flip this to the contradiction in relation, probably because they are keying into the fact that this is a negation, and they overfit on the idea that the presence of negation is a signal that you're in the contradiction relationship. So that's the sort of distressing thing again. Humans don't make these mistakes, but systems are very prone to them. Let me show you one more.
This is a slightly different adversarial attack. In this case, we're going to modify the premise. So the original training example was a woman is pulling a child on a sled in the snow, that entails a child is sitting on a sled in the snow. I think that's pretty clear. For their adversarial attack, they just swapped the subject and the object. So the new premise is a child is pulling a woman on a sled in the snow. We would expect that to lead to the neutral label for this particular hypothesis. But what Nie et al. observed is that the systems are kind of invariant under this changing of the word order. They continue to predict entailment, revealing that they don't really know what the subject and the object were in the original example, and it kind of does something much fuzzier with the set of words that are in that premise. Remember, these are at the time the very best systems for solving these problems. These are very simple kind of friendly adversaries that they're stumbling with.
So this could lead you to have two perspectives. I had showed you that Nick Bostrom one before where we worry about superintelligent systems, but on the other hand, we might be living in a world that's more like the one presented in this lovely book from a roboticist and practitioner, Daniel H. Wilson, called How To Survive A Robot Uprising, where he gives all sorts of practical advice like wear clothing that will fool the vision system of the robot, or walk up some stairs, or drench everything in water, very simple adversarial attacks that reveal that these robots are not creatures which we should be fearful of. And I feel like I've just shown you a bunch of examples that are the analogs of wearing misleading clothing in the space of Natural Language Processing, revealing that our systems are not superhuman understanders, or communicators, or anything like that, but rather still, to this day, fairly superficial pattern matchers.
Why is this all so difficult? It's hard to articulate precisely what is so challenging because this is probably deeply embedded in the whole human experience, but I think there are some pretty straightforward superficial things I can show you to just make alive for you how hard even the simplest tasks are.
So here, I've got an imagined dialogue of the sort you would hope Siri would do well with. Where is Black Panther playing in Mountain View? Black Panther is playing at the Century 16 Theater. When is it playing there? It's playing at 2:00, 5:00, and 8:00. OK, I'd like one adult and two children for the first show. How much would that cost? It seems like the most mundane sort of interaction. You would not expect a human to have any problem with any of these utterances, but think about how much interesting stuff is happening in this little dialogue. We have domain knowledge that tells us that this is a place where movies might play and that this is the name of a movie. That's already very difficult. And we have anaphora from the third utterance to the first.
When is it playing there? And I guess also into the second. Where these pronouns, you need to figure out what they refer to in the discourse. Then you get this sequence of responses, again, with some anaphora back to earlier utterances. And then something really complicated happens here. "I'd like one adult and two children for the first show." "First show" refers back to the sequence of things that was mentioned here, very difficult. "One adult and two children" is not a request for human beings, although that's what the forms would look like, but rather are a request for tickets. So somehow in the context of this discourse, "one adult and two children" is referring to tickets and not to people. How much would that cost? That is a kind of complicated event description referring to a hypothetical event of buying some tickets for a particular show. That's the referent of this "that" here-- highly abstract, very difficult at the level of resolving it in the discourse, and then figuring out what its actual content is. And this is for the most mundane sort of interaction, to say nothing of the complicated things that, for example, you and I will do when we discuss this material in just a few minutes. So I think this is why we're actually
quite far from the superintelligence that Bostrom was worried about. Here's our perspective. As I said, this is the most exciting moment ever in history for doing NLU. Why? Because there's incredible interest in the problems. Because we are making incredibly fast progress and doing things and solving problems that we never could have even tackled 15 years ago. On the other hand, you do not have the misfortune of having joined the field at its very end. The big problems remain to be solved. So there's a resurgence of interest and explosion of products. The systems are impressive, but their weaknesses make themselves quickly apparent. And when we observe those weaknesses, it's an opportunity for us to figure out what the problem is, and that could lead to the really big breakthroughs. And you all are now joining us on this journey if you haven't begun it already. And for your projects, you'll make some progress along the path of helping us through these very difficult problems the field is confronting, even in the presence of all these exciting breakthroughs. NLU is far from solved. The big breakthroughs lie in the future. So I hope that's inspiring. Now, let me switch gears a little bit and talk about the things that we'll actually be doing in this course to help set you on this journey that we're all on.
So we'll talk about the assignments, the bakeoffs, and the projects. The high-level summary here. Our topics are listed on the left. You can also see this reflected on the website. The one thing that I really do like about this particular plan is that it gives you exposure to a lot of different problems in the field, and also helps you with some tools and techniques that will be really useful no matter what problem you undertake for your final project. The same thing goes for the assignments. We're going to have three assignments, each with an associated bakeoff, which is a kind of competition around data. We're going to talk about word relatedness, cross-domain sentiment analysis, and generating color descriptions. This is a kind of grounded Language Understanding problem. Again, I think those are good choices because they expose you to a lot of different kinds of systems, techniques, model architectures, and so forth. And that should set you up really nicely to do a final project, which has three components, the literature review, an experimental protocol, and then the final paper itself. Our time for this quarter is somewhat compressed, so we'll have to make really good use of the time. But I think we have the schedule that will allow you to meaningfully invest in this preliminary work and still provide you with some space to do these final projects.
Let's talk about the assignments and bakeoffs themselves. So there are three of them. Each assignment culminates in a bakeoff, which is an informal competition in which you enter an original model. The original model question is part of the assignment. You do something that you think will be fun or interesting, and then the bakeoff essentially involves using that system to make predictions on the held-out test set. The assignments ask you to build these baseline systems and then design your original system as I said. Practically speaking, the way it works is that the assignments earn you 9 of the 10 points, and then you earn your additional point by entering your system into the bakeoff. And the winning bakeoffs can receive some extra credit. The rationale for all of this, of course, is that we want to exemplify the best practices for doing research in this space and help you do things like incrementally build up a project with baselines, and then, finally, an original system. But I should say it should be possible, and it's actually pretty common, for people to take original systems that they developed as part of one of these assignments, and use them for their final project. Each one of the assignments is set up specifically to make that kind of thing possible, and productive, and rewarding.
Let me show you briefly what the bakeoffs are going to be like. So the first one is word-related. The focus of that unit is on developing vector representations of words. You're going to start probably with big count matrices like the one you see here. This is a word-by-word matrix where all these cells give the number of times that these words are, in this case, emoticons, co-occurred with each other in a very large corpus of text. The striking thing about this unit is that there is a lot of information about meaning embedded in these large spaces. You will bend, and twist, and massage these spaces and maybe bring in your own vector representations or representations you've downloaded from the web, and you will use them to solve a word-relatedness problem.
So basically, you'll be given pairs of words like this with a score, and you will develop a system that can make predictions about new pairs. And the idea is to come up with scores that correlate with the held-out scores that we have not distributed, of course, as part of the test set. You'll upload your entry, and we'll give you a score, and then we'll look at what worked and what didn't across all of the systems. And the techniques that we'll explore are many, right? So we'll talk about reweighting, dimensionality reduction, vector comparison. You'll have an opportunity if you wish-- this is a brand new addition to the course-- to bring in BERT if you would like to. So there's lots of inspiring things to try building on the latest stuff that's happening in this space.
The second bakeoff is called cross-domain sentiment analysis. This is a brand new bakeoff. I'm very excited to see how this goes. Here's how this is going to work. We want to be a little bit adversarial with your system. So there are two datasets involved, the Stanford Sentiment Treebank, which is movie review sentences. And we're going to deal with it in its ternary formulation, so it has positive, negative, and neutral labels. That's SST-3. Alongside that, I'm going to introduce a brand new dev/test split previously unreleased, which is sentences from the Restaurant Review domain.
It has the same kind of labels but, of course, it's very different from the Movie Review domain along many dimensions. So for the bakeoff, you'll have the SST-3 train set. We're going to give that to you, and you are welcome to introduce any other data you would like to introduce as part of this training. That is entirely up to you. We're also distributing, for this bakeoff, two dev sets, the SST-3 dev set, which is public already, and this brand new one of Restaurant Review Sentences. And you can introduce other development sets if you want as part of tuning your system. And then the bakeoff will be conducted as the best that people can do jointly on SST-3 and this new test set, which is, again, held-out. The idea here is that you'll not only be doing a really great kind of project in the classification thing involving sentiment, but also, pushing your systems to adapt into new domains from the ones that they were trained on. Although, of course, part of that could be training in clever ways that do help you anticipate what's in this Restaurant Review data. And then the third bakeoff is a Natural Language Generation
task, but it's a grounded Natural Language Generation task. So for this bakeoff, you'll be given color context like three patches like this. And one of them is designated as the target. For the training data, we had people describe the target color in this context, and your task is to develop a system that can perform that same task, produce natural language utterances like this. I think this is a really cool problem because it's grounded outside of something that involves language, right? You're grounded in a color patch, it's just a numerical representation. And it's also highly context-dependent in that the choices people make for their utterances are dependent not only on the target-- the color they need to describe-- but the target in the context of these other two colors. And you can see that in these descriptions. This one is easy. So the person just said "blue." But in this case, since there are two blues, they said the darker blue one, kind of keying into the fact that the context contains two blue colors. This is an even more extreme case of that, dull pink, not the super bright one. Implicit reference is not only to the target but also to other colors that are distractors in that context. And then the final two are really extreme cases of this where for one in the same color in terms of its color representation, this person said purple and this person said blue in virtue of the fact
that the distractors are different. So we have this kind of controlled Natural Language Generation task that we can see is highly dependent on the particular context that people are in here. And again, this will follow the same path here. I'm going to give you the whole model architecture as a kind of default. It's an encoder-decoder architecture where you'll have a machine learning system that consumes color representations and then transfers that into a decoding phase where it tries to produce an utterance. So here, you consumed a bunch of colors and produced the description light blue, but, of course, you'll be able to explore many variants of this architecture and really explore different things that are effective in terms of this representation and Natural Language Generation task. And again, as a bakeoff, it will work the same way. You'll do a bunch of development on training data that we give you, and then you'll be evaluated on the held-out test set that was produced in the same fashion but involves entirely new colors and utterances.
Quick note on the original systems. As I said before, the original system is kind of a central piece of each one of the assignments. The homeworks really culminate in this original system, and that becomes your bakeoff entry. In terms of grading it, this is kind of hard, because we want you to be creative and try lots of things. So the way we're going to value these entries is any system that performs extremely well on the bakeoff will be given full credit, even systems that are very simple, right? We can't argue with success according to the criteria that we've set up. So if the simplest possible approach to one of these bakeoffs turns out to be astoundingly good and you had to do almost no work to succeed, you, of course, get full credit. But that's not the only thing we value, right? So systems that are creative and well-motivated will be given full credit even if they don't perform well on the bakeoff data. This is meant to be an explicit encouragement for you to try new things, to be bold, and be creative even if it doesn't numerically lead to the best results. In fact, some of the most inspiring things we've seen and insightful things we've seen as part of these bakeoffs, have been systems that didn't perform at the top of the heap, but harbored some really interesting insight that we could build on. And then, of course, systems that really are minimal, if you do very little and you don't do especially well at the bakeoff, will receive less credit. Specific criteria will depend on the nature of the assignment
and so forth, and we'll try to justify that for you. This is the more subjective part. I think 1 and 2 really encode the positive part of the kind of values that we're trying to convey to you as part of these original system entries.
And then you'll have project work. This occupies the entire second half of the course. At that point, the lectures, the notebooks, the readings, and so forth, are really focused on things like methods, metrics, best practices, error analysis, model introspection, and other things that will help you enrich the paper that you write. The assignments are all project-related-- the literature review, the experimental protocol, and the final paper. Unfortunately, in many years, we've
had a video presentation, which has always been really rewarding. But I feel like, given the compressed time schedule that we're on, we just don't have time for even short videos, so we're going to focus on these three crucial components. And then for exceptional final projects from past years that we've selected, you could follow this link. It's access restricted, but if you're enrolled in the course, you should be able to follow the link and see some examples. And there's a lot more guidance on final projects in our course repository. I have a very long write-up of FAQs and other guidance,
about publishing in the field and writing for this class. And I have what is now a really inspiringly long list of published papers that have grown out of work people have done for this course, so you can check that all out here. Final words here by way of wrapping up. As I said, this is the most exciting moment ever in history for doing NLU. This course will give you hands-on experience with a wide range of challenging problems. I emphasize the hands-on thing. I think this is so important. If you want to acquire a new skill like this, it's all well and good to watch other people doing it, but the way you really acquire the skill is by having hands-on experiences yourself. So everything about the requirements in the