Duolingo: Partnering with OpenAI for More Personalized Language Learning

By Steven Melendez |  July 7, 2023

Duolingo, the Pittsburgh-based language learning tech company, says its users answer a total of 1.25 billion quiz questions on its apps every day. The company has been using an AI system called Birdbrain to deliver questions to users based on their skill levels, and it has also recently integrated OpenAI’s GPT-4 technology into its premium Duolingo Max offering, to provide new interactive features, such as role playing scenarios to practice language use. Klinton Bicknell, head of AI at Duolingo, spoke about what the company has learned from both in-house AI developments and work with GPT-4.

Can you tell me a bit about the work you’ve been doing in AI and how that fits into the larger picture at the company?

Klinton Bicknell, Head of AI at Duolingo

Duolingo has been investing in AI for quite a while, starting in early days with models predicting what words people knew, and things like that, from the data.

These days, we’re using machine learning and AI throughout most of the Duolingo products. We use it for a lot of interactive features. The new Duolingo Max stuff is one of the things that fall into this bucket, using GPT-4.

We also use a lot of it for personalization of the whole experience. The Birdbrain model is one of the things that falls into that bucket. We’re using AI in helping create a lot of our content in collaboration with our in-house experts. And in the speech area, we do a lot of things: recognizing learner speech, and generating speech in the voice of our different Duolingo world characters.

The Duolingo English certification uses AI throughout the test in helping generate questions, grading questions, administering the test, and in the security space.

Lately, we were very excited to be early partners with OpenAI on GPT-4, which really advances what you can do with AI in a lot of particular domains, especially relating to interactive language use cases. One of [the first features we have created] is the Explain My Answer feature. There’s a common problem where people get something wrong, but don’t know why their answer was wrong. And in many cases, you’re trying to translate something. There are an infinite number of possible answers people could have. You can’t just pre-write the explanation for all the different wrong answers, so you really need AI to do that. And GPT-4 allowed us to finally do that, and in keeping with Duolingo style — not using too much grammatical jargon…

The other new Max feature is Roleplay, which is basically a way of having a conversation with an AI for a particular scenario. For example, you’re ordering a coffee in a Parisian cafe, and you have the whole dialogue and can kind of take it in different directions, what kind of drink you want, and you can get to chat a little bit with the barista as you’re having this conversation, but being able to practice that kind of interactive back-and-forth for a particular goal. Again, we’re using our Duolingo world cast of characters who you are interacting with, so the barista is, maybe, Lily, our kind of rebel teenage personality.

How did that partnership with OpenAI come about?

Basically, they gave us access to the [GPT-4] model back in September, actually before ChatGPT had even come out. It really blew us away initially, because its capabilities were just far beyond what existing models could do. As soon as we saw it, we immediately said, “Alright, well, we need to have a team that is working on building features around this, because this really changes the game of what we can offer.”

The first two features that we came up with were what ultimately became Explain My Answer and Roleplay, two ideas that we thought were promising, very useful use cases, things we’ve been looking to offer for a while but couldn’t — things that get us closer to having all the nice properties of a personal tutor. But also, critically, [they were] things that we thought wouldn’t need a long development time, because we wanted to get this stuff out relatively quickly.

We were working closely with OpenAI, iterating on our end. We had weekly meetings with them talking about pain points. We also helped provide some training data to them to actually help them improve the GPT-4 model…

In that process between September, and when we initially softly rolled these things out to some users — early February or so — we were working closely with OpenAI, iterating on our end. We had weekly meetings with them talking about pain points. We also helped provide some training data to them to actually help them improve the GPT-4 model for these kinds of use cases. So it was really quite a useful collaboration. We talked a bit about preventing things going wrong in different ways.

One interesting thing there is that this generation of technology is so good at following really complicated instructions really well. Still not perfectly, but really well. So building the initial prototype for Explain My Answer was done in something like a day. You just ask the model, “Hey, here’s a learner, they’re learning this language. What was the mistake they made? Explain why it’s wrong. And don’t use complicated words.” You can say something like that, and it will generally do something more or less like what you want.

But then a lot of the iteration came to figure out how to go from a prototype that worked mostly right 70 percent of the time, to something pretty reliably producing good responses that we were confident about and that were fun and engaging. The hallucination problem, I’m sure you’ve heard about: sometimes it would give a completely made-up reason why your answer was wrong that just has no basis in reality.

What did that process of iteration look like?

There were a lot of different pieces to that. One part of it was just gathering data, getting it to explain a lot of answers, and getting someone who actually knew the language to say, “Yeah, that’s a good explanation,” or, “Nope, this is wrong.” Factual accuracy was obviously one very important [dimension], but also things like avoiding too much like grammatical jargon. Another one [was] being friendly, and not making the person feel too bad about their answer, doing it in an encouraging way.

Now that it’s actually available to the public, what are you observing, and how are you testing and measuring to make sure it’s working?

It’s still very early days in terms of hard data. I think we’re not sharing any numbers at this point. We’re still in pretty limited rollout, but anecdotally we’ve definitely seen a lot of enthusiasm for these features.

One other aspect of the Roleplay that I didn’t mention earlier: At the end, after you have your conversation and you get your drink or whatever and you pay, then the narrator pops in to tell you some tips on how to improve next time you do this kind of experience, like, “You said this when you’re ordering a drink, but actually would have been a little more natural to say it this way.” There’s been especially a lot of enthusiasm for that, where it’s not just having the conversation, but this kind of consolidation at the end.

At a higher level, we are really excited about a lot of different use cases, for GPT-4 across the company, I think these two are really just sort of just scratching the surface of what we can do there. So we have a number of other teams, looking at other things that we can do with this technology. A lot of them are other types of interactive features, things that were just deep levels of interaction that just weren’t possible, up until these this kind of technology. We are also exploring incorporating this model into a lot of our content generation processes, that should allow us to basically allow our courses to cover even more advanced topics and just have more variety of exercises. Right now, when we create course content, we really try to make sure everything is up to a very high quality bar. It takes a while, it’s expensive, and it takes a lot of work, and so incorporate these sorts of models can really help to speed up that process by a lot. And we’re already seeing a lot of gains in terms of how much this can automatically generate.

And then, separately, historically, a lot of the AI work that we’ve done at Duolingo has been about personalization — using our giant pile of data about all of our learners, who do about 1.25 billion exercises every day on Duolingo, to figure out how to optimize people’s learning experiences, both overall and for each person given their history. Out of the box, GPT-4 doesn’t know anything about that. It’s trained on language data on the internet, essentially, not how people learn in a particular environment like this. So we are also actively exploring ways of incorporating either these models or technologies like these models into our personalization really make the lessons and everything even more tied to exactly what what you’re trying to learn, and exactly where you’re at.

…These large language models like GPT-4 enable people who have no expertise in AI to, essentially, do AI.

Is there a big difference in your processes working with external models like GPT versus building models in house?

It is definitely different in a lot of ways. I think maybe one of the most salient ways is actually that these large language models like GPT-4 enable people who have no expertise in AI to, essentially, do AI. You can build an AI feature on top of one of these without really having any idea how anything works under the hood. You can ask the model, “Hey, do X,” and it might do X pretty well, so it’s been very interesting thinking about how that democratizes access to AI.

Historically, a team shipping AI features like Max would involve a lot of AI experts together with engineers and designers and product all the other people who are useful to ship a good feature and learning scientists. But for these features, we really had only a handful of AI experts.

So it is interesting that problems are bifurcating into cases where you need expertise and cases where you don’t. And most but not all of that distinction is about how much you need to be using your own data for this, versus just the data that the models were already trained on.

What did the process of putting together personalization look like before GPT-4?

A Duolingo image explaining Birdbrain’s approach to personalizing the learning experience.

Our biggest personalization model is Birdbrain. This is a model that updates every day based on 1.25 billion exercises that learners do, to basically make predictions about what all those learners know. And then we use those predictions to figure out the right exercises to show this person that are going to be at their level and maximally effective at teaching them and also keeping them engaged. Things that are not going to be too hard and be discouraging, but also not going to be boring for them, too easy. So we’ve been building Birdbrain for years. That is currently using a neural network technology that is not quite the same as what GPT-4 is using, but it’s sort of related. And now we’re doing a lot of research into exactly how you combine these things in the most useful way.

Another direction that we are going in is using models like GPT-4 to do things in a lightweight way: summarize the last 20 mistakes this person made, or the last 100 mistakes that they made, what are they? What are the concepts that they’re struggling with? It will be able to give a language-based answer: this person is a little confused about how the past tense works or something.

And then we can go from that to directly generating some exercises targeting that, or basically matching on the closest exercises that we already have focusing on that topic. So that’s another way that we’re thinking about using those models. That approach actually wouldn’t require training the model on our own data. It’s that final step of matching it up to our own data and using our mistake data as input.

Anything else you want to share about where AI is headed at Duolingo?

I think that all of these recent advances in AI have just really changed the nature of what’s possible. And I think that Duolingo in particular, by virtue of having our very large amount of data on how people learn, is in a really exciting position right now…combining all of our data with the latest technology. I think that in the next year or two, we’re really going to see the nature of that kind of personalized education in your pocket change pretty dramatically. As a result, we’re going to be able to really teach people, not just language, but also other subjects, much better and in a much more engaging way.

Key insights…
• Lots of iteration is required to create a generative AI experience that feels engaging (and doesn’t make things up).

• Creating true AI competencies will involve building models internally as well as leveraging the best external models.

• Not all AI opportunities and use cases require a team of experts. AI projects may be bifurcating into those where you need deep expertise, and those where more “general purpose” technologists can deliver impressive results.