The article offers a thorough overview of vector embeddings and their associated databases. Additionally, we will develop a Closed-QA Bot using the Mistral-7B model and ChromaDB.
Introduction
In the whirlwind of advancements that characterize the current era of Generative AI (GenAI) and Large Language Models (LLMs), popular applications such as ChatGPT, Anthropic’s Claude, Google Gemini have captured the public's imagination. However, the true breadth of possibilities these technologies herald extends far beyond the functionalities of chatbots and AI assistants that have made headlines. At the heart of unlocking this vast potential lies a critical, albeit less understood, concept: embeddings. This foundational element serves as the lingua franca of GenAI and LLMs, a crucial mechanism that enables these models to interpret and generate human-like text, and increasingly, to understand images and videos.
For those venturing into the realm of LLMs, terms like "vectors," "tokens," and "embeddings" frequently surface, often without clear context or understanding. Grasping these concepts is not just academic; it's essential for anyone looking to leverage LLMs to address complex business challenges or to create more nuanced and intelligent AI-driven solutions. As we move towards a multimodal future, where AI's ability to process various forms of data becomes paramount, understanding these terms becomes indispensable.
Usually, Large Language Models (LLMs) undergo training across a diverse range of datasets. Occasionally, this training approach can result in the production of responses that are inaccurate or exhibit bias, consequences of assimilating content from the extensive and unmoderated internet. To mitigate these issues, the innovation of Vector Databases is presented. These specialized databases archive data in a distinct format termed 'vector embeddings,' enhancing the ability of LLMs to interpret and apply information with greater contextual relevance and precision.
This blog post aims to demystify these core concepts through an accessible exploration, complemented by straightforward examples and practical code snippets. Our journey will navigate the synergistic relationship between Large Language Models and Vector Databases—a pairing that significantly amplifies the capabilities of LLMs beyond their standalone use.
Vector embeddings
Vector embeddings are numerical representations that capture the meaning and relationships of words, sentences, and other data. They convert data into points in a multidimensional space, where similar data is clustered together, enabling machines to understand and process the information more effectively.
For example, in the case of text data, “cat” and “kitty” have similar meaning, even though the words “cat” and “kitty” are very different if compared letter by letter. For semantic search to work effectively, representations of “cat” and “kitty” must sufficiently capture their semantic similarity. This is where vector representations are used, and why their derivation is so important.
In the following diagram, we illustrate how text is transformed into word vectors, a process in NLP that allows for the quantification and analysis of linguistic connections. For instance, the vector for "puppy" would be located closer in the vector space to "dog" than to "house," indicating their semantic closeness. This method also applies to analogical relationships. For example, the vector difference and orientation between "man" and "woman" might be similar to that between "king" and "queen." This demonstrates that word vectors not only encapsulate word meanings but also facilitate meaningful comparisons of their semantic relations within a multidimensional vector space.
Vector representation of words. Credits
Beyond word and sentence embeddings, vector embeddings can represent documents, images, user profiles, products, and more. These embeddings help machine learning algorithms find patterns and perform tasks like sentiment analysis, language translation, and recommendation systems.
Embeddings from Documents, Audio and Images.Credits
Types of Vector Embeddings
Word Embeddings: These are vectors that embody individual words, employing techniques such as Word2Vec, GloVe, and FastText. Such methods ascertain word embeddings by gleaning semantic connections and contextual insights from vast text datasets.
Sentence Embeddings: This type involves representing sentences in their entirety as vectors. Approaches like the Universal Sentence Encoder (USE) and SkipThought produce embeddings that encapsulate the core essence and context of the sentences.
Document Embeddings: Documents, ranging from news articles and scholarly papers to entire books, are represented as vectors. This encapsulates the document's semantic nuances and overall context. Methods such as Doc2Vec and Paragraph Vectors are crafted to derive these document embeddings.
Image Embeddings: Images are converted into vector form, encapsulating various visual attributes. Techniques including convolutional neural networks (CNNs) and pre-trained models like ResNet and VGG are employed to create image embeddings for tasks such as image categorization, object identification, and assessing image similarity.
User Embeddings: These vectors represent users within a system or platform, capturing their preferences, activities, and traits. User embeddings find applications in diverse areas, from recommendation engines and tailored marketing to segmenting users.
Product Embeddings: In e-commerce or recommendation environments, products are represented as vectors. This encapsulation includes the product's characteristics, attributes, and any semantic data available. Such embeddings enable algorithms to compare, recommend, and evaluate products based on their vectorized representations.
Are Embeddings and Vectors the same thing?
In the context of vector embeddings, "embeddings" and "vectors" essentially refer to the same concept. Both terms denote the numerical representations of data, with each data point depicted as a vector within a high-dimensional space.
The term "vector" specifically denotes an array of numbers that has defined dimensions. For vector embeddings, these vectors symbolize various data points positioned in a continuous vector space. On the other hand, "embeddings" specifically relate to the method of portraying data as vectors, engineered to encapsulate meaningful information, semantic relations, or contextual attributes. Embeddings aim to reflect the intrinsic structure or qualities of the data and are usually developed through training algorithms or models.
Although the terms can be used interchangeably within the scope of vector embeddings, "embeddings" accentuates the concept of data representation in a significant and organized manner, whereas "vectors" primarily refer to the format of the numerical representation itself.
What is a Vector Database?
A vector database is a specialized type of database designed to store, manage, and operate on vector embeddings efficiently. These embeddings are high-dimensional vectors that represent various types of data such as text, images, audio, and more in a vector space. The purpose of a vector database is to facilitate operations that rely heavily on the similarity of these embeddings, which is a key aspect in many machine learning and artificial intelligence applications.
Here are some key features and functions of vector databases:
Efficient Similarity Search: Vector databases are optimized to perform fast and accurate similarity searches. They can quickly identify vectors in the database that are closest to a given query vector, typically using distance metrics like Euclidean distance or cosine similarity. This capability is crucial for applications like recommendation systems, image retrieval, and semantic text search.
Scalability: They are designed to handle large volumes of high-dimensional data and scale efficiently as the size of the data grows. This is essential for applications dealing with big data sets.
Indexing and Querying: Advanced indexing techniques are used to manage the high-dimensional space and speed up query processing. This includes tree-based indexing, hashing, or even approximate nearest neighbor (ANN) algorithms, which provide a good balance between accuracy and speed.
Integration with Machine Learning Pipelines: Vector databases easily integrate with machine learning workflows, especially those involving neural networks and deep learning. They can store and retrieve embeddings generated by machine learning models, supporting dynamic updating and querying as new data is processed.
Real-time Processing: Many vector databases support real-time data insertion and querying, making them suitable for dynamic and interactive applications where quick response times are crucial.
Overall, vector databases play a crucial role in modern data ecosystems, particularly in scenarios where understanding the semantic relationships between data points is necessary. They provide a robust infrastructure for managing and leveraging vector data to power various AI-driven applications.
How does a vector database work?
Vector databases function by utilizing algorithms to both index and query vector embeddings. These algorithms facilitate approximate nearest neighbor (ANN) searches through methods like hashing, quantization, or graph-based techniques.
A typical Vector Database pipeline. Credits
For retrieving data, ANN searches pinpoint the nearest vector neighbor to a query. This method, being less computationally demanding than a kNN (known nearest neighbor or true k nearest neighbor algorithm) search, offers a trade-off with its reduced accuracy. Nonetheless, it achieves efficient and scalable results for extensive datasets of high-dimensional vectors.
Indexing: Vector databases index vectors using hashing, quantization, or graph-based approaches, enabling quicker searches:
Hashing: Algorithms like locality-sensitive hashing (LSH) excel in ANN searches due to their speed and ability to produce approximate results. LSH maps nearest neighbors using hash tables, similar to organizing a Sudoku puzzle. A query hashed into a table is then compared against vectors in the same table for similarity.
Quantization: Techniques like product quantization (PQ) break vectors into smaller segments, code these segments, and then reassemble them. This process produces a code representation of a vector, with a collection of these codes forming a codebook. When queried, the database matches the query against this codebook to identify the closest code, thus finding the most similar vector.
Graph-based: Using graph algorithms such as Hierarchical Navigable Small World (HNSW), vectors are represented as nodes within a graph. Nodes are clustered and connected by edges to similar nodes, creating hierarchical graphs. Queries navigate these graphs to locate nodes with vectors closest to the query vector.
Additionally, vector databases index data object metadata, leading to the presence of both a vector index and a metadata index.
Querying: Upon receiving a query, a vector database compares indexed vectors with the query vector using mathematical similarity measures to identify the nearest vector neighbors:
Cosine similarity calculates similarity on a scale from -1 to 1, determining how vectors relate based on the cosine of the angle between them. Values range from -1 (opposite directions), 0 (orthogonal), to 1 (identical).
Euclidean distance measures the straight-line distance between vectors, with 0 indicating identical vectors and higher values showing greater differences.
Dot product similarity spans from minus infinity to infinity, assessing the product of two vectors' magnitudes and the cosine of their angle, indicating the directionality and alignment of vectors.
Post-processing: After the initial search, some vector databases re-rank or re-filter the results using different similarity measures or by filtering based on metadata to refine the nearest neighbors identified in the search.
Some databases also include a preprocessing or pre-filtering stage before running a vector search, applying filters based on specific criteria to refine the search process further.
Core components of vector databases
A vector database typically consists of the following fundamental components:
Performance and Fault Tolerance: To ensure high performance and resilience against failures, vector databases implement sharding and replication. Sharding distributes data across multiple nodes, while replication creates multiple copies of data on different nodes. This setup maintains performance continuity and fault tolerance in the event of a node failure.
Monitoring Capabilities: Effective operation of a vector database requires continuous monitoring of resource usage, query performance, and overall system health. This is essential to maintain performance standards and system reliability.
Access Control Capabilities: Security management is crucial in vector databases. Access control measures help ensure compliance, accountability, and enable auditing of database activities. These controls ensure that data is accessible only by authorized users and that all user activities are recorded.
Scalability and Tunability: As data volumes grow, the ability to scale horizontally is essential. Access controls also influence the scalability and tunability of a vector database, accommodating varying rates of data insertion, query demands, and differing hardware environments.
Multiple Users and Data Isolation: Accommodating multiple users or supporting multi-tenancy is important for vector databases. They must also ensure data isolation, keeping user activities like insertions, deletions, or queries confidential among users unless sharing is explicitly authorized.
Backups: Regular data backups are a crucial aspect of vector database management, providing a safety net in case of system failures, data loss, or corruption. Backups help restore the database to its previous state, minimizing downtime and data loss.
APIs and SDKs: Vector databases utilize APIs (Application Programming Interfaces) to facilitate a user-friendly interaction layer, allowing applications to communicate through requests and responses. SDKs (Software Development Kits) wrap around these APIs, offering tools in various programming languages that simplify the development of applications (like semantic search or recommendation systems) without needing deep knowledge of the underlying database structure.
After integrating all the components, this is how vector embedding and vector databases function together.
Available Vector Databases
Below image illustrates classification of vector databases into two primary categories: Dedicated vector databases and databases that support vector search. Each category is further divided based on the availability and licensing of the software:
Dedicated Vector Databases:
These are specialized databases designed primarily for vector storage and search capabilities.
The image shows the logos of several dedicated vector databases:
chroma
vespa
marqo
LanceDB
Milvus
These databases are marked as open source with an Apache 2.0 or MIT license.
Databases that Support Vector Search:
These are general databases that have extended functionality to support vector search.
The logos included in this category are:
OpenSearch
PostgreSQL
ClickHouse
cassandra
elasticsearch
redis
ROCKSET
SingleStore
The landscape of vector databases. Credits
Building a Closed-QA Bot with Mistral-7B and ChromaDB
In this section, we describe the methodology for constructing a Closed Q&A bot using a Vector database. This bot is specifically tailored to handle science-related inquiries with a combination of advanced technological components:
Databricks-dolly-15k HuggingFace Dataset: This is an open-source dataset comprised of instruction-following entries produced by Databricks employees. It is utilized for training large language models (LLMs), synthetic data production, and data augmentation. The dataset encompasses a variety of prompts and responses across categories such as brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
Chroma as the Vector Store (Knowledge Base): We use Chroma as the primary vector storage, serving as the knowledge base for our bot.
Sentence Transformers for Semantic Search: We specifically employ the 'multi-qa-MiniLM-L6-cos-v1' model from Sentence Transformers, which is optimized for semantic search applications. This model generates embeddings that are stored in Chroma.
Mistral 7B Instruct Model: Mistral 7B is a 7-billion-parameter language model released by Mistral AI. Mistral 7B is a carefully designed language model that provides both efficiency and high performance to enable real-world applications. Due to its efficiency improvements, the model is suitable for real-time applications where quick responses are essential.
Setting up the Environment
For implementing the code discussed in this article, the following installations are necessary:
!pip install -qU \
torch==2.0.1 \
einops==0.6.1 \
accelerate==0.20.3 \
datasets==2.14.5 \
chromadb \
sentence-transformers==2.2.2
!pip install git+https://github.com/huggingface/transformers
Constructing the “Knowledge Base”
Initially, we obtain the Databricks-Dolly dataset and concentrate on the closed_qa segment. The entries in this category, which typically require accurate and specific information, present a unique challenge to a Large Language Model (LLM) that has been trained on a broad range of data due to their detailed nature.
from datasets import load_dataset
# Load only the training split of the dataset
train_dataset = load_dataset("databricks/databricks-dolly-15k", split='train')
# Filter the dataset to only include entries with the 'closed_qa' category
closed_qa_dataset = train_dataset.filter(lambda example: example['category'] == 'closed_qa')
print(closed_qa_dataset[0])
A typical dataset entry appears as follows:
{
'instruction': 'When did Virgin Australia start operating?',
'context': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.",
'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.',
'category': 'closed_qa'
}
Now lets create word embeddings for each pair of instructions and their associated contexts, incorporating these into our vector database, ChromaDB.
ChromaDB stands out as an open-source vector storage solution adept at handling vector embeddings. It is specifically designed to support applications such as semantic search engines, which are essential in the fields of natural language processing and machine learning. The in-memory database capabilities of ChromaDB ensure quick access and efficient handling of data, which is vital for processing data at high speeds. Its compatibility with Python makes it even more suitable for our project, as it allows for seamless integration into our existing workflow.
For comprehensive information, refer to the ChromaDB Documentation.
For the creation of embeddings for the answers, we employ the multi-qa-MiniLM-L6-cos-v1 model, tailored for semantic search scenarios. This model excels at identifying pertinent text passages in response to a question or search request, aligning perfectly with our objectives.
Below, we demonstrate the method by which embeddings are maintained within the in-memory collections of Chroma.
import chromadb
from sentence_transformers import SentenceTransformer
class VectorStore:
def __init__(self, collection_name):
# Initialize the embedding model
self.embedding_model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
self.chroma_client = chromadb.Client()
self.collection = self.chroma_client.create_collection(name=collection_name)
# Method to populate the vector store with embeddings from a dataset
def populate_vectors(self, dataset):
for i, item in enumerate(dataset):
combined_text = f"{item['instruction']}. {item['context']}"
embeddings = self.embedding_model.encode(combined_text).tolist()
self.collection.add(embeddings=[embeddings], documents=[item['context']], ids=[f"id_{i}"])
# Method to search the ChromaDB collection for relevant context based on a query
def search_context(self, query, n_results=1):
query_embeddings = self.embedding_model.encode(query).tolist()
return self.collection.query(query_embeddings=query_embeddings, n_results=n_results)
# Example usage
if __name__ == "__main__":
# Initialize the handler with collection name
vector_store = VectorStore("knowledge-base")
# Assuming closed_qa_dataset is defined and available
vector_store.populate_vectors(closed_qa_dataset)
For every record in the dataset, we create and preserve an embedding that fuses the 'instruction' and 'context' data, where the context serves as the retrieval document for our LLM prompts.
Following this, we plan to deploy the Mistral-7b-instruct LLM to produce answers to closed informational queries independently of supplementary context, thus demonstrating the enhanced capability of our knowledge base.
To generate basic answers, we will use the Mistral-7b-instruct model from Hugging Face. This efficient, advanced model from the Falcon series delivers strong language understanding and generation capabilities in a manageable size, making it well-suited for our needs. When running the model, keep in mind the hardware requirements - it needs at least 16GB of RAM, and using a GPU is highly recommended for optimal performance and faster response times.
import transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
class Mistral7BModel:
def __init__(self):
# Model name
model_name = "/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1"
self.pipeline, self.tokenizer = self.initialize_model(model_name)
def initialize_model(self, model_name):
# Tokenizer initialization
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Pipeline setup for text generation
pipeline = transformers.pipeline(
"text-generation",
model=model_name,
tokenizer=tokenizer,
device=0, # This sets the pipeline to run on the first GPU
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
trust_remote_code=True,
)
return pipeline, tokenizer
def generate_answer(self, question, context=None):
# Preparing the input prompt
prompt = question if context is None else f"{context}\n\n{question}"
# Generating responses
sequences = self.pipeline(
prompt,
max_length=500,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
)
# Extracting and returning the generated text
return sequences
Now lets infer the model.
# Initialize the Falcon model class
modelObj = Mistral7BModel()
user_question = "When was Tomoaki Komorida born?"
# Generate an answer to the user question using the LLM
answer = modelObj.generate_answer(user_question)
print(f"Result: {answer}")
{'generated_text': 'When was Tomoaki Komorida born?
A:1963-08-01'}
user_question = "When was Microsoft founded?"
# Generate an answer to the user question using the LLM
answer = modelObj.generate_answer(user_question)
generated_text = answer[0]['generated_text']
print(f"Result: {generated_text}")
user_question = "How to build a image classification model using CNN"
# Generate an answer to the user question using the LLM
answer = modelObj.generate_answer(user_question)
generated_text = answer[0]['generated_text']
print(f"Result: {generated_text}")
Generating Context-Aware Answers
Now, let's elevate our generative model's capability by providing it with relevant context, retrieved from our vector store.
Interestingly, we're using the same VectorStore class we for both generating embeddings and fetching context from the user question:
user_question = "Who is Elon Musk"
# Generate an answer to the user question using the LLM
answer = modelObj.generate_answer(user_question)
generated_text = answer[0]['generated_text']
print(f"Result: {generated_text}")
# Assuming vector_store and Mistral have already been initialized
# Fetch context from VectorStore, assuming it's been populated
context_response = vector_store.search_context(user_question)
# Extract the context text from the response
# The context is assumed to be in the first element of the 'context' key
context = "".join(context_response['documents'][0])
# Generate an answer using the Falcon model, incorporating the fetched context
enriched_answer = modelObj.generate_answer(user_question, context=context)
generated_text = enriched_answer[0]['generated_text']
print(f"Result: {generated_text}")
Conclusion
Throughout our comprehensive guide, we've taken you through the steps of developing a sophisticated Large Language Model (LLM) application, powered by bespoke datasets. We've unveiled the intricacies of managing such a model, from the experimental stages with various datasets to establishing the required infrastructure, and finally bringing to life a functioning solution. Moreover, we've demonstrated how to construct a Closed Q&A bot, leveraging the capabilities of a vector database—ChromaDB—coupled with the Mistral-7B model. This bot is adept at addressing scientific questions, illustrating the powerful synergy between cutting-edge technology and specialized data.
In essence, vector embeddings and vector databases are pivotal in amplifying the utility and precision of LLMs, enabling them to process and interact with information at an unprecedented level.
Comments