Nagesh Singh Chauhan
OpenAI GPT-3: Understanding the Architecture
The article contains an in-depth understanding of the very famous OpenAI 's GPT-3 language model.
Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that employs deep learning to produce human-like text. It is the 3rd-generation language prediction model in the GPT-n series created by OpenAI, a San Francisco-based artificial intelligence research laboratory. GPT-3's full version has a capacity of 175 billion machine learning parameters. GPT-3, which was introduced in May 2020, and is in beta testing as of July 2020, is part of a trend in natural language processing (NLP) systems of pre-trained language representations. Before the release of GPT-3, the largest language model was Microsoft's Turing NLG, introduced in February 2020, with a capacity of 17 billion parameters or less than 10 percent compared to GPT-3.
“I am open to the idea that a worm with 302 neurons is conscious, so I am open to the idea that GPT-3 with 175 billion parameters is conscious too.” — David Chalmers
The quality of the text generated by GPT-3 is so high that it is difficult to distinguish from that written by a human, which has both benefits and risks. Thirty-one OpenAI researchers and engineers presented the original May 28, 2020 paper introducing GPT-3. In their paper, they warned of GPT-3's potential dangers and called for research to mitigate risk. David Chalmers, an Australian philosopher, described GPT-3 as "one of the most interesting and important AI systems ever produced."
Microsoft announced on September 22, 2020, that it had licensed "exclusive" use of GPT-3; others can still use the public API to receive output, but only Microsoft has control of the source code.
GPT-3 is a very large language model. Given some input text, it can probabilistically determine what tokens from a known vocabulary will come next. Before we go ahead and see what makes GPT-3 so special, let's first understand what is a language model?
What are Language Models?
Language modeling (LM) is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models analyze bodies of text data to provide a basis for their word predictions. So simply put, a Language Model predicts the next word(s) in a sequence.
Language models — among other things — can suggest the next word we type. Source
Language models have many applications like:
Part of Speech (PoS) Tagging
News Article Generation
Question Answering, etc.
How does language modeling work?
Language Models determine the probability of the next word by examining the text in the data. These models analyze the data by feeding it through algorithms.
The algorithms are accountable for creating rules for the context of natural language. The models are designed for the prediction of words by learning the features and characteristics of a language. With this learning, the model prepares itself for learning phrases and predicting the next words in sentences.
For training a language model, a number of probabilistic approaches are used. These approaches vary on the basis of the purpose for which a language model is created. The amount of text data to be analyzed and the math applied for analysis make a difference in the approach followed for creating and training a language model.
Consider an arbitrary language L. In this case, English will be utilized to simplify the arbitrary language. A language model assigns probabilities to sequences of arbitrary symbols such that the more likely a sequence (w1,w2,...,wn) is to exist in that language, the higher the probability. A symbol can be a character, a word, or a sub-word (e.g. the word ‘going’ can be divided into two sub-words: ‘go’ and ‘ing’). Most language models estimate this probability as a product of each symbol's probability given its preceding symbols:
The probability of a sentence can be defined as the product of the probability of each symbol given the previous symbols. Credits
OpenAI GPT-3 Architecture
The GPT-3 is not one single model but a family of models. Each model in the family has a different number of trainable parameters. The following table shows each model, architecture, and its corresponding parameters:
Sizes, architectures, and learning hyper-parameters of the GPT-3 models
In fact, the OpenAI GPT-3 family of models is based on the same transformer-based architecture of the GPT-2 model including the modified initialization, pre-normalization, and reverse tokenization, with the exception that it uses alternating dense and sparse attention patterns.
The largest version GPT-3 175B or “GPT-3” has 175 B Parameters, 96 attention layers, and a 3.2 M batch size.
Shown in the figure above is the original transformer architecture. As mentioned before, OpenAI GPT-3 is based on a similar architecture, just that it is quite larger. While language models like BERT use the Encoder to generate embeddings from the raw text which can be used in other machine learning applications, the GPT family uses the Decoder half, so they take in embeddings and produce text.
Why GPT-3 is so powerful?
The first thing that GPT-3 overwhelms with is its sheer size of trainable parameters which is 10x more than any previous model out there.
Evolution of language models. Credits
The model is built using the standard concepts of Transformer, Attention, etc, and using the typical Common Crawl, Wikipedia, Books, and some additional data sources. A lot of things — pre-training, model, data are similar to GPT-2, but everything (model size, data size, training time) is just a lot bigger. In fact, its humongous size is what drives most of the benefits of the model.
The following graph shows the benefit in accuracy for various Zero / One / Few shot tasks as a function of a number of Model parameters, clearly major gains are achieved due to the scaled-up size.
Accuracy for various Zero / One / Few shot tasks. Credits
Most of the things used in the model are so huge — for example 96 Attention layers, a Batch Size of 3.2M, 175B Parameters — that they are unlike anything in the past. The model is ~10x larger in terms of the number of parameters than the next closest thing (Microsoft Turing NLG with 17B parameters)
There is no need to do gradient/parameter updates (fine-tuning) for using the GPT-3 model for various tasks. One can just interact with the model using natural language and/or provide some examples of the tasks that you are trying to do and the model will do it!
Three settings in which GPT-3 can perform the task of translating from English to French.
GPT-3 has made headlines since last summer because it can perform a wide variety of natural language tasks and produces human-like text. The tasks that GPT-3 can perform include, but are not limited to:
Text classification (ie. sentiment analysis)
Based on the tasks that GPT-3 can perform, we can think of it as a model that can perform reading comprehension and writing tasks at a near-human level except that it has seen more text than any human will ever read in their lifetime. This is exactly why GPT-3 is so powerful. Entire startups have been created with GPT-3 because we can think of it as a general-purpose swiss army knife for solving a wide variety of problems in natural language processing.
Use cases of GPT-3
Writing and translation
We have looked at how GPT-3 manages to generate impressive text. It’s only natural to have a use case in writing. Thanks to its ability to produce believable text, the model can be used to write most if not all forms of literature. Check out the examples of GPT-3 creative writing.
The model can write fiction, crack jokes, write poems, and generate conversation manuscripts among much more. Provided with the correct prompt, it can write convincing and captivating articles. It is also capable of generating all sorts of documents, from business memos to legal documents. Besides writing, as we had mentioned before, the model is pretty good at language translation tasks as well.
The model has the ability to generate code in different languages. In most examples I’ve seen, it just takes in an English description of the requirements and generates pages. Here’s an example.
GPT-3 can generate website mock-ups as well. I’ve seen examples like this one where it takes a description of the desired website and a URL to create mock-ups. This is particularly useful to UI/UX designers.
Here is an example of GPT-3 going a step further and explaining code in English.
Building machine learning models/code
There are examples that show GPT-3 generating code for machine learning models. In one example, it only needs a description of the dataset and required output. Check it out here.
For more cool examples of the uses of GPT-3, check out GPT Crush and GPT-3 Examples.
How Can We Get Our Hands on the Model?
You can’t just download the model or train it on your own even if you have the infrastructure. OpenAI has built an API that is available through a waiting list. You can visit their site and join the waiting list. In fact, you can go to the demo section of https://beta.openai.com and try out some demos yourself to get a fair idea of how some of the use-cases work.
Demo section at https://beta.openai.com
Limitations of OpenAI GPT-3
The creators of GPT-3 themselves accept that the model has its weaknesses and does commit silly mistakes. In particular, it does not perform well on text synthesis tasks like repetitions, contradictions, coherence loss over long passages, etc.
Consider some of the limitations of GPT-3 listed below:
GPT-3 lacks long-term memory — the model does not learn anything from long-term interactions like humans.
Lack of interpretability — this is a problem that affects extremely large and complex in general. GPT-3 is so large that it is difficult to interpret or explain the output that it produces.
Limited input size — transformers have a fixed maximum input size and this means that prompts that GPT-3 can deal with cannot be longer than a few sentences.
Slow inference time — because GPT-3 is so large, it takes more time for the model to produce predictions.
GPT-3 suffers from bias — all models are only as good as the data that was used to train them and GPT-3 is no exception. This paper, for example, demonstrates that GPT-3 and other large language models contain anti-Muslim bias.
While GPT-3 is powerful, it still has limitations that make it far from being a perfect
language model or an example of artificial general intelligence (AGI).
Future of GPT-3
OpenAI and others are working on even more powerful and large models. There are a number of open-source efforts in play to provide a free and non-licensed model as a counterweight to Microsoft's exclusive ownership. OpenAI is planning larger and more domain-specific versions of its models trained on different and more diverse kinds of texts. Others are looking at different use cases and applications of the GPT-3 model. However, Microsoft's exclusive license poses challenges for those looking to embed the capabilities in their applications.
GPT-3 has received a lot of attention since last summer because it is by far the largest and arguably most powerful language model created at the time of writing this article. However, GPT-3 still suffers from several limitations that make it far from being a perfect language model or an example of AGI. If you would like to use GPT-3 for research or commercial purposes, you can apply to use Open AI’s API which is currently in private beta. Otherwise, you can always work directly with GPT-2 which is publicly available and open-source thanks to HuggingFace’s transformers library.