Nagesh Singh Chauhan
- Mar 12, 2022
- 8 min read

Google BERT: Understanding the Architecture

Google uses the BERT algorithm, to better understand the users’ search intentions, which helps it to provide more relevant results.

Image credits

BERT and RankBrain: History

In 2015, the search engine announced an update that transformed the search universe: RankBrain. It was the first time the algorithm embraced artificial intelligence to understand content and search.

Like BERT, RankBrain also uses machine learning but does not do Natural Language Processing(NLP). This method concentrates on query analysis and grouping words and phrases that are semantically identical but cannot understand the human language on its own.

So, when a new query is made on Google, RankBrain analyzes past searches and identifies which words and phrases best match that search, even if they don’t match exactly or have never been searched.

As they receive user-interaction signals, the bots learn more about the relationships between words and improve ranking.

Therefore, this was Google’s first step in understanding human language. Even today, it is one of the methods used by the algorithm to understand search intentions and page contents in order to present better results to users.

So, BERT did not replace RankBrain — it just brought another method of understanding human language. Depending on the search, Google’s algorithm can use either method (or even combine the two) to deliver the best response to the user.

Keep in mind that Google’s algorithm is formed by a vast complexity of rules and operations. RankBrain and BERT play a significant role, but they are only parts of this robust search system.

Introduction

In this article, we will discuss BERT: Bidirectional Encoder Representations from Transformers; which was proposed by Google AI in the paper, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. This is one of the groundbreaking models that has achieved the state of art in many downstream tasks and probably one of the most exciting developments in NLP in recent years.

In 2019, even Google has announced that it is using BERT in its search, supposedly the “biggest leap forward” it did in understanding search in the past five years. That is a huge testament to come from Google. About Search! That’s just how significant BERT is.

It is worth mentioning, last year, Prabhakar Raghavan, Senior Vice President at Google, announced the launch of a new AI model called Multitask Unified Model (MUM) at the Google I/O event. While this new model runs on the T5 framework, which is similar to BERT, MUM is superior to BERT by 1000 times and is the future of AI behind google search engines.

What is BERT?

BERT is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google.[1][2] In 2019, Google announced that it had begun leveraging BERT in its search engine, and by late 2020 it was using BERT in almost every English-language query.

BERT is at its core a transformer language model with a variable number of encoder layers and self-attention heads. The architecture is "almost identical" to the original transformer implementation in Vaswani et al. (2017).

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism of the Transformer is necessary. The detailed workings of the Transformer are described in a paper by Google.

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

The bidirectionality of a model is important for truly understanding the meaning of a language. Let’s see an example to illustrate this. There are two sentences in this example and both of them involve the word “bank”:

If we try to predict the nature of the word “bank” by only taking either the left or the right context, then we will be making an error in at least one of the two given examples.

One way to deal with this is to consider both the left and the right context before making a prediction. That’s exactly what BERT does! We will see later in the article how this is achieved.

The diagram below is a high-level description of the Transformer encoder and decoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.

The above figure is a high-level diagram of a Transformer. Credits.

The stack of Encoders is basically BERT, and if we just stack Decoders that are nothing but GPT(Generative Pre-trained Transformer), however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left.

When training language models, there is a challenge of defining a prediction goal. Many models predict the next word in a sequence (e.g. “The child came home from ___”), a directional approach that inherently limits context learning. To overcome this challenge,

We currently have two variants available for BERT:

BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
BERT Large: 24 layers (transformer blocks), 16 attention heads, and, 340 million parameters

Credits

Both BERT model sizes have a large number of encoder layers (which the paper calls Transformer Blocks) – twelve for the Base version, and twenty-four for the Large version. These also have larger feedforward networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the default configuration in the reference implementation of the Transformer in the initial paper (6 encoder layers, 512 hidden units, and 8 attention heads).

How BERT work?

There are two steps in the BERT framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For finetuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.

BERT pre-training uses two training strategies:

1. Masked Language Modeling (MLM)

BERT is designed as a deeply bi-directional model. The network effectively captures information from both the right and left context of a token from the first layer itself and all the way through to the last layer.

Predicting the word in a sequence

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires:

Adding a classification layer on top of the encoder output.
Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
Calculating the probability of each word in the vocabulary with softmax.

Masked Language Modeling (MLM).Credits

The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. As a consequence, the model converges slower than directional models, a characteristic that is offset by its increased context-awareness.

2. Next Sentence Prediction (NSP)

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

Next Sentence Prediction(NSP). Credits

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.
A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.

Credit.

To predict if the second sentence is certainly connected to the first, the following steps are performed:

The entire input sequence goes through the Transformer model.
The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).
Calculating the probability of IsNextSequence with softmax.

When training the BERT model, Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies.

Summing both Masked LM and Next Sentence Prediction tasks, below is the architecture pre-training in BERT.

Pre-training architecture in BERT. Credits

How training is done in BERT?

Adding the three vectors gives embedding vector which is used as input to BERT. Segment and Positional embedding is required for temporal ordering since all these vectors are fed simultaneously to BERT so it's good for a language model to know the ordering of the words.

The output is C(a binary value) and a bunch of other word vectors. For training, we need to minimize loss.

We need to take each Ti word vector, pass it to a fully connected layered output with the same number of neurons which is equal to the number of tokens in the vocabulary. Then apply the softmax activation function, this way we convert a word vector into a distribution.

The actual word of the distribution would be a one-hot encoded vector for all the actual words. We then compare the two distributions and then train the network using cross-entropy loss.

Note that the output has all the words even though these inputs weren't masked at all. The loss function only considers the prediction of the masked words and ignores all the other words output by the network. This is done to ensure more focus is given in predicting these masked words, so it predicts them correctly and increases context awareness.

How to use BERT (Fine-tuning)

Fine-tuning is straightforward since the self-attention mechanism in the Transformer allows BERT to model many downstream tasks— whether they involve single text or text pairs—by swapping out the appropriate inputs and outputs.

For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional cross attention. BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.

BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:

Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.
In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.

Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different downstream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers). Credits