Breaking Boundaries with Generative AI: A Closer Look at Large Language Models
How Large Language Models are Transforming the Creative Landscape.
Language holds great significance as it serves as our means of acquiring knowledge about the world, whether through news, web searches, or references like Wikipedia. It is also the medium through which we shape the world, establishing agreements, enacting laws, and conveying messages. Language facilitates connections and communication among individuals, groups, and companies.
Despite the rapid advancements in software, computers still face limitations when it comes to dealing with language. While software excels at precise text searches, it often falls short in handling the more complex aspects of language that humans employ in their daily lives.
This calls for the development of more intelligent tools that can better comprehend language.
LLM adaptation. Credits
In recent years, there has been a significant breakthrough in the field of natural language processing (NLP) and artificial intelligence, marked by the emergence of Large Language Models (LLMs) like OpenAI's GPT-3 (Generative Pre-trained Transformer 3). These models have captured the attention of researchers, developers, and the general public alike.
In this blog post, we will delve into the world of LLMs, exploring their capabilities and examining their impact across various fields.
What are Large Language Models (LLMs)?
Large language models (LLMs) predominantly fall under the category of transformer neural networks, which are deep learning architectures. A transformer model is a type of neural network that acquires context and meaning by examining relationships within sequential data, such as the words present in this sentence.
A transformer consists of multiple transformer blocks, also referred to as layers. These layers encompass self-attention layers, feed-forward layers, and normalization layers, all collaborating to decode input and predict sequences of output during inference. By stacking these layers, transformers can become deeper, resulting in more potent language models. The concept of transformers was initially introduced by Google in the 2017 paper titled "Attention Is All You Need."
LLMs undergo training on vast quantities of textual data, enabling them to comprehend the intricate patterns, grammar, and semantics of human language. Leveraging their deep neural network architecture, LLMs can generate text that closely resembles human writing, comprehend context, and accomplish a diverse range of language-related tasks.
Transformers excel in large language models due to two significant innovations: positional encodings and self-attention.
Positional encoding captures the order in which elements appear in a given sequence. Instead of sequentially feeding words into the neural network, positional encoding allows non-sequential input.
Self-attention assigns weights to different parts of the input data during processing. These weights indicate the relevance of each input in relation to the rest. As a result, models can allocate attention selectively, focusing on the input elements that truly matter. The neural network learns over time which parts of the input demand attention as it sifts through extensive data.
By combining these techniques, transformers can effectively analyze the intricate connections and contexts among distinct elements, even over long distances and in a non-sequential manner.
How Do Large Language Models Work?
During the training phase, large language models are exposed to extensive amounts of text data. This training data enables the model to learn the statistical patterns, grammar rules, and semantic relationships present in human language. The model's parameters are adjusted through an optimization process to minimize the difference between the model's predicted output and the desired output.
Once trained, these models can be utilized for a wide range of language-related tasks. They can generate coherent and contextually relevant text, summarize information, translate between languages, perform sentiment analysis, answer questions, and more. To enhance the performance of large language models for specific applications or domains, a process called fine-tuning is often employed. Fine-tuning involves further training the model on domain-specific or task-specific data, allowing it to specialize in a particular area and improve its performance in that specific context.
LLM training and Fine-tuning.Credits
In summary, large language models utilize deep learning techniques and self-attention mechanisms to understand and generate text. They are trained on extensive text data to learn language patterns, and their parameters are optimized to minimize prediction errors. These models can then be fine-tuned for specific tasks, making them powerful tools for natural language processing and text-related applications. There is also an interesting phenomenon of in-context learning being investigated by researchers from MIT, Stanford, and Google Research. It refers to a situation where a large language model can accomplish a task by observing only a few examples, even if it was not initially trained specifically for that task.
For instance, if the model is provided with several sentences conveying positive or negative sentiments, it can accurately discern the sentiment of a new sentence. Typically, a machine learning model like GPT-3 would require retraining with new data to perform a different task. However, in the case of in-context learning, the model's parameters remain unchanged, giving the impression that it has acquired new knowledge without undergoing additional training.
Two examples of in-context learning, where a language model (LM) is given a list of training examples (black) and a test input (green) and asked to make a prediction (orange) by predicting the next tokens/words to fill in the blank. Credits
"By gaining a deeper understanding of in-context learning, researchers can potentially enable models to tackle novel tasks without the need for resource-intensive retraining," explains Ekin Akyürek, the lead author of the paper that explores this recent phenomenon.
LLMs are few shot learners
The GPT-3 paper titled "Language Models Are Few-Shot Learners" demonstrated that large language models (LLMs) enhance their few-shot learning capabilities by scaling up in terms of parameter and dataset size. This is significant because few-shot learning enables models to perform well on various tasks without requiring fine-tuning on task-specific data.
The trend of improving few-shot learning abilities appears to continue with ChatGPT, as evidenced by a paper that evaluates its performance on 20 NLP datasets. While it proves to be competitive compared to fine-tuning, there are still certain tasks, such as named entity recognition, summarization, and sentiment analysis, where it may not excel as much.
Examples of zero-shot, one-shot, and few-shot learning through prompting. Credits
Nevertheless, we can anticipate that this gap will progressively diminish, and LLMs will eventually achieve remarkable accuracy without the need for fine-tuning. It is plausible that GPT-4 has already made significant progress in closing this gap, although there is currently no official and comprehensive analysis of its performance on NLP datasets.
Alignment and reinforcement learning through human feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) is a technique in machine learning that trains AI models based on feedback from humans. A "reward model" is created from this feedback and used to optimize the model's behavior through reinforcement learning. The reward model predicts whether a given output is good or bad. RLHF enhances the robustness and exploration of AI agents, particularly when the reward function is limited or noisy.
To collect human feedback, humans are asked to rank instances of the AI's behavior. These rankings are then used to score outputs using methods like the Elo rating system.
Overview of RLHF from OpenAI. Credits
In simpler terms, RLHF trains AI models by learning from human responses to their performance. If the AI makes mistakes, human feedback helps correct errors and suggest better responses. This iterative process helps the model improve over time. RLHF is useful in tasks where finding an algorithmic solution is challenging, but humans can easily evaluate the quality of the AI's output. For example, in generating compelling stories, humans can rate different AI-generated stories, and the AI can use its feedback to enhance its story-generation skills.
Technical detail note: The above diagram makes it look like both models generate different responses for the same prompt, but what really happens is that the RL policy generates text, and that text is fed into the initial model to produce its relative probabilities for the KL penalty. Credits
How LLMs' performance is evaluated?
The perplexity of a language model on a specific text corpus is the most commonly used measure to evaluate its performance.
Perplexity serves as a gauge of surprise or unpredictability in the model's response to new data. A lower perplexity value suggests that the model is better equipped to handle diverse inputs.
Perplexity measures the model's predictive capability on a dataset. The higher the likelihood the model assigns to the dataset, the lower the perplexity. Mathematically, perplexity is calculated as the exponential of the average negative log-likelihood per token.
Here N is the number of tokens in the text corpus, and "context for token i" depends on the specific type of LLM used. If the LLM is autoregressive, then "context for token i" is the segment of text appearing before token i. If the LLM is masked, then "context for token i" is the segment of text surrounding token i.
Because language models may overfit their training data, models are usually evaluated by their perplexity on a test set of unseen data. This presents particular challenges for the evaluation of large language models. As they are trained on increasingly large corpora of text largely scraped from the web, it becomes increasingly likely that models' training data inadvertently includes portions of any given test set.
Now, let us compare the perplexity of two sentences with GPT2 and see how perplexed it is. We first load a tokenizer and a causal head for the GPT2 model from HuggingFace:
from transformers import AutoModelForCausalLM, AutoTokenizer model= AutoModelForCausalLM.from_pretrained("gpt2") tokenizer = AutoTokenizer.from_pretrained("gpt2") inputs = tokenizer("ABC is a startup based in New York City and Paris", return_tensors = "pt") loss = model(input_ids = inputs["input_ids"], labels = inputs["input_ids"]).loss ppl = torch.exp(loss) print(ppl) Output: 29.48 inputs_wiki_text = tokenizer("Generative Pretrained Transformer is an opensource artificial intelligence created by OpenAI in February 2019", return_tensors = "pt") loss = model(input_ids = inputs_wiki_text["input_ids"], labels = inputs_wiki_text["input_ids"]).loss ppl = torch.exp(loss) print(ppl) Output: 211.81
As you can see, the first sentence is one of the sequences on which the model was trained and hence the perplexity is much lower in comparison to the second sentence. The model has not seen the second sentence before and hence the GPT2 model is more perplexed by it.
Perplexity is usually used only to determine how well a model has learned the training set. Other metrics like BLEU, ROUGE, etc., are used on the test set to measure test performance.
What are the Challenges of Large Language Models?
Naturally, like any technology, large language models (LLMs) have their limitations.
Reliability and bias
One significant challenge is ensuring the accuracy and reliability of the content they generate. While LLMs can produce content that resembles a specific author or genre, they may also generate inaccurate or misleading information, especially in the case of news articles or other content requiring high precision. Also, When training data isn’t examined and labeled, language models have been shown to make racist or sexist comments.
To address this limitation, one approach is to leverage conversational AI, connecting the LLM to a trustworthy data source like a company's website. This enables the utilization of the LLM's generative capabilities to create valuable content for virtual agents, including training data and brand-aligned responses.
Megatron-Turing was developed with hundreds of NVIDIA DGX A100 multi-GPU servers, each using up to 6.5 kilowatts of power. Along with a lot of power to cool this huge framework, these models need a lot of power and leave behind large carbon footprints.
According to a study, training BERT (LLM by Google) on GPU is roughly equivalent to a trans-American flight.
Developing large language models requires significant investment in the form of computer systems, human capital (engineers, researchers, scientists, etc.), and power. Being resource intensive makes the development of large language models only available to huge enterprises with vast resources. It is estimated that Megatron-Turing from NVIDIA and Microsoft has a total project cost of close to $100 million
Each large language model only has a certain amount of memory, so it can only accept a certain number of tokens as input. For instance, ChatGPT has a limit of 2048 tokens (around 1,500 words), which means ChatGPT can’t make sense of inputs and generate outputs for inputs exceeding the 2048 token limit.
Nevertheless, due to the stochastic nature of these models, achieving 100% accuracy is currently unattainable. Therefore, it remains crucial to involve humans in the loop to verify any content generated by an LLM before it reaches the end user. This becomes especially important in enterprise settings where concerns regarding potential liability may arise.
You may have noticed the recentness of many of these LLMs—this field is evolving rapidly, and the pace is accelerating even further, as indicated by the increasing number of parameters. However, it's important to remember that the true value of a model lies in its practical application.
Over the past few years, the size of large language models has been growing exponentially, increasing by a factor of 10 each year. This trend is reminiscent of Moore's Law, but we should be aware that it can lead to diminishing returns, higher costs, increased complexity, and new risks. We have encountered similar situations in the past, and it is essential to learn from those experiences as we navigate this path.