The Basics of Language Modeling with Transformers: How do we Measure Natural Language Understanding?

Viren Bajaj
November 12, 2021
Two robots. One speaks a bunch of 0s and 1s. The other one says, "I know right?"


Modeling natural language has become one of the flagship successes of deep learning techniques. We see these techniques deployed at scale in many software systems, including the one used to draft this article: Google Docs. With every additional letter typed, Google Docs tries to predict what is about to come next - you can say we complete each other's sentences.

The program that is giving this recommendation is calculating the probability of the next words in the sentence given the ones already typed. The program giving these recommendations is based on a language model based called the transformer. But before we dive deep into how a transformer works, let's get some context about the field of natural language understanding (NLU). 

I believe a good way to begin to understand any system, including language models, is to examine what we expect a system to do: how can one measure whether a system understands language? Concretely discussing the inputs and outputs of the tasks that such systems are tested on can give us an anchor point to discuss methods used to achieve them. The task that helps evaluate the effectiveness of a recommender in Google Docs, for example, is simply known as sentence completion in which the input is a sequence of tokens (characters) and the output is the predicted set of tokens that complete the phrase correctly. In Part I of this series of blog posts, I will discuss how we measure natural language understanding.

Measuring NLU systems boils down to creating tests that try to capture what we think it means to understand a language spoken by humans. As one can imagine, there can be many ways to interpret what it means to understand language, and the scientific community has devised a variety of ways to do so; in this article, I discuss a few of the prominent tests which are used by the community to convey the flavor of what it means to create such tests. These are:

  1. Natural Language Inference (NLI)
  2. Question Answering (QA)
  3. Textual Similarity
  4. Classification

Natural Language Inference (NLI)

NLI consists of inferring whether a hypothesis is true, undetermined, or false given a prompt that forms the premise. For example, given the prompt, “Two caps and a postcard hang on the wall”, the hypothesis “There is a cap on the wall” is true, “A postcard with a painting of the countryside hangs on the wall” is undermined, and “There are three caps hanging on the wall” is false. Algorithms must take in a premise, and hypothesis and output a label: true, undetermined, or false. More formally the truth of a hypothesis is known as containment, i.e, the hypothesis is contained in the premise, an undetermined one is called neutral, and a false one is called a contradiction.

You might think that one could construct tricky examples in which it isn’t apparent whether a hypothesis falls clearly in one of the three categories. In fact this becomes obvious when one realizes that all languages are inherently ambiguous. This is why when creating data sets often many people are employed to label a premise, hypothesis pair and a vote is taken to decide the label. 

Some popular natural language inference data sets are:

Question Answering

Question Answering is a task that can be presented in many formats. One of them is the multiple choice format in which a context prompt, a question and multiple choice questions which have a single correct answer. An algorithm must take as input the context, questions, potential answers and output the correct answer. The complexity of the data presented depends on whether the potential answers contain verbiage from the context prompt, which makes predicting the answer easy, or paraphrased, which implies the algorithm must ‘understand’ in a deeper sense what the answer means. The most prominent multiple choice question answering dataset is RACE: Large Scale ReAding Comprehension Dataset from Examinations.

Another popular format in which question-answering data is presented is called the cloze-style. This is a common-sense reasoning task in which the program has to predict the correct sentence that follows a fixed number of contiguous sentences. For example, in the story-cloze test, the system has to choose the correct ending to a four sentence story from two alternate sentences that end the story. The correct sentence can be thought of as entailment (recall from NLI task) and the incorrect sentence can be thought of as a contradiction.

Textual Similarity

Textual Similarity is a task that requires predicting the probability of how similar in meaning two bodies of text are. This task consists of either giving a score between 1 to 5 of how semantically similar two texts are or figuring out whether two texts are paraphrases (or ‘duplicates’) of each other.

The most prominent dataset for this task is Quora Question Pairs, which contains 400,000 pairs of questions that are labelled as duplicate or not. Similarly, Microsoft Research Paraphrase Corpus (MRPC) consists of 5800 sentence pairs indicating whether they are paraphrases of each other or not. 

There also exists a framework to evaluate the quality of a given sentence representation (embedding) called SentEval. It includes the set of tasks that were selected based on "what appears to be the community consensus regarding the appropriate evaluations for universal sentence representations", which include similarity, classification and NLI.


Classification is a term that subsumes many tasks in which bodies of text are classified into different categories. The categories can be based on sentiment, in which the system has to choose whether a body of text represents a positive, neutral or negative sentiment. Note that a dataset and system can be designed to have more granular sentiment categories.  The Stanford Sentiment Treebank (SST) is a dataset that uses a 1-5 scale to classify movie reviews from most negative to most positive. Another classification task is deciding linguistic acceptability, which is deciding whether a sentence is grammatically correct or not. A dataset for this task is the Corpus of Linguistic Acceptability (CoLA).


Some important tasks used to test modern deep learning systems that model natural language are natural language Inference (NLI), question answering (QA), textual similarity, and classification. I discussed what these tasks mean the the most popular datasets used to test the performance of natural language models on these tasks.

Now that we have a flavor of what tasks NLU systems such as the transformer are trying to perform, let us explore how such state of the art systems came to be in the next post.