The Basics of Language Modeling with Transformers: GPT

Viren Bajaj
November 14, 2021


OpenAI's GPT is a language model based on transformers that was introduced in the paper “Improving Language Understanding using Generative Pre-Training” by Rashford, et. al. in 2018. It achieved great success in its time by pre-training the model in an unsupervised way on a large corpus, and then fine tuning the model for different downstream tasks. This technique of performing task-agnostic training followed by fine tuning was distinguished from the training task-specific models, which had previously achieved state of the art performance.

Unsupervised Pre-Training

During pre-training, a neural network is used to maximize the likelihood of the next token given the previous k tokens. Concretely, given an unlabeled corpus U= {u, u, …, u}, the likelihood L₁(U):

Standard language modeling objective used in unsupervised pertaining

is maximized using stochastic gradient descent over the parameters 𝚯 of a transformer decoder (shown below), which models the conditional probability P.

Transformer architecture used in GPT during pre-training

Supervised Fine Tuning

Objective Functions

Once the transformer model has been pre-trained, a new linear (fully connected) layer is attached to the output of the transformer which is then passed through a softmax function to produce the output required for the specific task, such as Natural Language Inference, Question Answering, Document Similarity, and Classification. The model is fine tuned on each of these supervised tasks using labelled datasets. The the supervised objective function over a labelled dataset C with a data point: (x = (x,…,x),y) then becomes L₂(C):

Supervised fine tuning objective function used in GPT

Sometimes to improve performance, language modeling objective (L₁(C)) is added to the fine-tuning objective with a regularization term to achieve the final loss L(C):

Final objective during supervised fine-tuning for certain tasks: L3 = L2 + lambda * L1

Input Transformations

Input transformations for different supervised fine-tuning tasks

Common to all taks is the fact that the input is sandwiching the input text in between randomly initialized start and end tokens.

In Classification, the standard input transformation of using a start and end token around the input text is used.

In Natural language inference (Entailment), the premise and hypothesis are separated by the delimiter token ($).

In (Text) Similarity, two texts are separated by the delimiter token ($) and passed through the transformer once, and then a second time with the order of the texts swapped. Each output is then concatenated and passed through the linear+softmax layer. This is because there is no inherent order present in a similarity task.

In the Multiple Choice Question Answering task, the context, question, and each answer is separated with a delimiter ($) and passed through the transformer and linear layer independently. Finally, the linear output of each possible answer is passed through a softmax to get a normalized probability distribution over possible choices.

Performance of GPT

Natural Language Inference (NLI)

GPT outperformed state of the art models on all NLI datasets except Recognizing Textual Entailment (RTE).

GPT Results on NLI tasks. 5x indicates ensemble of 5 models. Evaluation metric is Accuracy.

Question Answering (QA)

GPT outperformed other state of the art models on all QA datasets.

GPT Results on QA Tasks. 9x is an ensemble of 9 models. Evaluation metric is Accuracy.

Similarity and Classification

GPT outperformed on most semantic similarity and classification datasets except the Stanford Sentiment Treebank (SST2) and Microsoft Research Paraphrase Corpus (MRPC).

GPT Results on Similarity and Classification. mc: Mathews Correlation; acc: Accuracy; pc: Pearson Correlation


In this article I discussed how GPT achieved state of the art results on NLI, QA, similarity and classification tasks. This model paved the way for GPT-2 and the famous GPT-3, which are just larger versions of GPT with a few tweaks.


Rashford, et. al. (2018) “Improving Language Understanding using Generative Pre-Training”