top of page
  • Writer's pictureNagesh Singh Chauhan

DSPy: A Revolutionary Framework for Programming LLMs

The article offers a comprehensive exploration of the DSPy module, aimed at optimizing LLM prompts for enhanced performance.

"DSPy is a framework for programming language models"

What is DSPy?

DSPy, short for "Declarative Self-improving Language Programs", stands at the forefront of merging Large Language models (LLMs) and Retrieval Models (RMs) to tackle complex tasks.

Developed by the innovative minds at Stanford NLP, DSPy heralds a new era of "programming with foundation models." This framework transcends traditional prompting techniques by emphasising a programming-centric approach, thus shifting LM-based pipeline development towards a more structured and efficient programming paradigm.

DSPy's mission is to address the inherent fragility of LLM-based applications by advocating for a programming-first methodology. This shift allows for the dynamic recompilation of the entire pipeline, tailored specifically to the nuances of the task at hand, thereby eliminating the need for continuous manual prompt adjustments.

Functionality at its Core:

  • Simulation and Instruction: DSPy meticulously simulates code execution on given inputs, guiding LMs through the program's declarative steps with an automated compiler.

  • Pythonic Elegance: Offering modules that are both composable and declarative, DSPy introduces a familiar Pythonic syntax for instructing LMs, paving the way for intuitive and streamlined programming practices.

By fostering a programming-over-prompting mindset, DSPy not only simplifies the integration of foundational models into applications but also significantly enhances adaptability and efficiency in LM-based solutions.

DSPy can routinely teach powerful models like GPT-3.5 or GPT-4 and local models like T5-base or Llama2-13b to be much more reliable at tasks, i.e. having higher quality and/or avoiding specific failure patterns.

The workflow of building an LM-based application with DSPy, is shown below. It will remind you of the workflow for training a neural network:

Workflow of building an LLM-based app with DSPy. Credits

  1. Gather Dataset: Assemble a selection of input-output examples (such as question-answer pairs) for refining your pipeline.

  2. Develop DSPy Program: Craft the logic of your program using signatures and modules, detailing the flow of information to address your specific task.

  3. Establish Validation Logic: Create criteria for enhancing your program based on a validation metric and an optimizer (teleprompter).

  4. Compile with DSPy: Utilize the DSPy compiler, incorporating your training dataset, program, optimizer, and validation metric to enhance your program (for instance, through prompt optimization or fine-tuning).

  5. Refine Continuously: Engage in an iterative cycle of refining your dataset, program, or validation logic to achieve the desired performance level of your pipeline.

DSPy and Other Frameworks

Innovations like LangChain and LlamaIndex have found their place in the toolkit of many who work with large language models (LLMs), addressing unique challenges and enhancing model capabilities. However, navigating tasks involving complex logic or customizing models for specific needs can still pose challenges.

Enter DSPy, a fresh take aimed at simplifying how we interact with LLMs. By embracing a programming-first mentality, DSPy seeks to streamline tasks that traditionally required intricate prompt crafting or intensive model tweaking. Think of DSPy and LangChain as handy assistants for crafting applications powered by language models.

LangChain specializes in dissecting complex problems into manageable chunks, producing structured outputs that refine DSPy's functionality—think sharpening the focus on tricky tasks like spotting fibs with greater precision.

LlamaIndex steps into the spotlight by supercharging the ability to sift through and retrieve precise information from vast datasets, making it an indispensable ally for those looking to harness the full potential of language models for detailed and accurate data exploration.

On the flip side, DSPy shines by dialing down the need for manual prompt crafting. Its toolkit allows any language model to be fine-tuned to chase after specific goals, making models more versatile and adaptable to new challenges or datasets.

DSPy might remind some of PyTorch, a stalwart in the deep learning arena, where data scientists lay out neural networks and employ layers and optimizers to embed logic seamlessly. DSPy's toolkit, including elements like ChainOfThought or Retrieve, works similarly, tweaking and tailoring prompts to hit the mark on chosen metrics.

DSPy Programming Model

At its core, DSPy introduces a structured, programming-centric approach that aims to elevate the efficiency and effectiveness of language model applications. Here's a closer look at the three fundamental components of the DSPy programming model and how they revolutionize the development process:

  • Signatures: Abstracting prompting and fine-tuning

  • Modules: Abstracting prompting techniques

  • Teleprompters: Automating prompting for arbitrary pipelines


In DSPy, when we delegate tasks to Large Language Models (LLMs), we outline the expected behavior through what we call a "Signature".

A Signature in DSPy is essentially a contract that specifies the expected input/output dynamics of a module. It focuses on what the LM needs to achieve, steering away from dictating the specific prompts to use for the task.

Similar to how function signatures work by detailing the input and output parameters along with their types, DSPy Signatures operate on a parallel concept but with a couple of notable distinctions:

  • Unlike traditional function signatures that primarily describe parameters, DSPy Signatures actively shape and govern the module's behavior.

  • The terminology used in DSPy Signatures is crucial; it communicates the semantic roles clearly, distinguishing between elements like a 'question' and an 'answer' or 'sql_query' and 'python_code'.

The Value of Utilizing DSPy Signatures

In Brief: Utilizing DSPy Signatures fosters modular, streamlined code that enables optimization of LM interactions into effective prompts or facilitates automatic fine-tuning.

Detailed Perspective: Conventionally, tasks are imposed on LMs through elaborate and fragile prompt engineering or by generating specific datasets for fine-tuning. DSPy's approach to writing signatures offers a more structured, flexible, and repeatable method. The DSPy compiler is tasked with crafting an optimized prompt or fine-tuning the LM specifically for the outlined Signature, based on the data at hand and within the set pipeline. This process often surpasses human capability in prompt creation not through creativity but via extensive experimentation and direct metric optimization.

Implementing DSPy Signatures

Signatures can succinctly be articulated as strings, with named arguments that signify the roles of inputs and outputs.

DSPy signatures replace hand-written prompts.

A signature is a tuple of input and output fields in its minimal form.

Structure of a minimal DSPy signature

For instance:

  • Question Answering could be represented as "question -> answer".

  • Sentiment Classification as "sentence -> sentiment".

  • Summarization as "document -> summary".

Furthermore, Signatures can accommodate multiple inputs/outputs, such as:

  • Retrieval-Augmented Question Answering: "context, question -> answer".

  • Multiple-Choice Question Answering with Reasoning: "question, choices -> reasoning, selection".

Tip: While naming fields, ensure they are semantically meaningful. However, simplicity is key initially; avoid over-optimizing terms prematurely. For summarization, descriptors like "document -> summary", "text -> gist", or "long_context -> tldr" are sufficiently clear. The DSPy compiler is adept at refining these terms to optimize interaction with the LM.

Example: Summarization

# Example from the XSum dataset.
document = """The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page."""

summarize = dspy.ChainOfThought('document -> summary')
response = summarize(document=document)



The 21-year-old Lee made seven appearances and scored one goal for West Ham last season. He had loan spells in League One with Blackpool and Colchester United, scoring twice for the latter. He has now signed a contract with Barnsley, but the length of the contract has not been revealed.

Many DSPy modules (except dspy.Predict) return auxiliary information by expanding your signature under the hood.

For example, dspy.ChainOfThought also adds a rationale field that includes the LLM's reasoning before it generates the output summary.

print("Rationale:", response.rationale)


Rationale: produce the summary. We need to highlight the key points about Lee's performance for West Ham, his loan spells in League One, and his new contract with Barnsley. We also need to mention that his contract length has not been disclosed.

Some optimizers in DSPy, like SignatureOptimizer, can take this simple docstring and then generate more effective variants if needed.

class Emotion(dspy.Signature):
    """Classify emotion among sadness, joy, love, anger, fear, surprise."""
    sentence = dspy.InputField()
    sentiment = dspy.OutputField()

sentence = "i started feeling a little vulnerable when the giant spotlight started blinding me"  # from dair-ai/emotion

classify = dspy.Predict(Emotion)




A DSPy module acts as a foundational component for constructing programs that leverage Large Language Models (LLMs).

You might already know about various techniques to prompt models, including starting prompts with phrases like "Your task is to ..." or "You are a ...", employing a Chain of Thought approach with cues like "Let's think step by step", or concluding prompts with directives such as "Don't make anything up" or "Only use the provided context".

DSPy modules are designed with templates and parameters to encapsulate these prompting strategies. Essentially, they serve to tailor DSPy signatures for specific tasks through the application of prompting, fine-tuning, augmentation, and reasoning methods.

How can you utilize a pre-defined module, such as dspy.Predict or dspy.ChainOfThought?

Begin with dspy.Predict, the core module upon which all other DSPy modules are constructed.

It's presumed you have some understanding of [DSPy signatures], which are essentially declarations that outline how any given module in DSPy should behave.

To implement a module, you initially define it with a specific signature. Following this, you execute the module using the provided input arguments and then retrieve the results from the designated output fields.

sentence = "it's a charming and often affecting journey." 

# 1) Declare with a signature.
classify = dspy.Predict('sentence -> sentiment')

# 2) Call with input argument(s). 
response = classify(sentence=sentence)

# 3) Access the output.



When we declare a module, we can pass configuration keys to it.

Below, we'll pass n=5 to request five completions. We can also pass temperature or max_len, etc.

Let's use dspy.ChainOfThought. In many cases, simply swapping dspy.ChainOfThought in place of dspy.Predict improves quality.

question = "What's something great about the ColBERT retrieval model?"

# 1) Declare with a signature, and pass some config.
classify = dspy.ChainOfThought('question -> answer', n=5)

# 2) Call with input argument.
response = classify(question=question)

# 3) Access the outputs.


['One great thing about the ColBERT retrieval model is its superior efficiency and effectiveness compared to other models.',
 'Its ability to efficiently retrieve relevant information from large document collections.',
 'One great thing about the ColBERT retrieval model is its superior performance compared to other models and its efficient use of pre-trained language models.',
 'One great thing about the ColBERT retrieval model is its superior efficiency and accuracy compared to other models.',
 'One great thing about the ColBERT retrieval model is its ability to incorporate user feedback and support complex queries.']

Initial implementation of the signature "context, question -> answer" with a ChainOfThought module

What additional DSPy modules exist, and how are they utilized?

Their operation is quite similar, primarily differing in how they actualize your signature!

  • dspy.Predict: This is the foundational prediction module that directly utilizes the signature without any alterations. It manages essential learning processes such as saving instructions, demonstrations, and updates to the LM.

  • dspy.ChainOfThought: This module guides the LM to process information in a sequential, step-by-step manner before generating a response aligned with the signature.

  • dspy.ProgramOfThought: It instructs the LM to produce code, the execution of which determines the response in accordance with the signature.

  • dspy.ReAct: Functions as an agent capable of employing tools to fulfill the requirements of the specified signature.

  • dspy.MultiChainComparison: This module is capable of evaluating multiple outcomes from the ChainOfThought process to arrive at a conclusive prediction.

There are also modules resembling functions, such as:

  • dspy.majority: Executes a simple voting mechanism to identify and return the most common answer from a group of predictions.

Optimizers (formerly Teleprompters)

A DSPy optimizer is designed to adjust the settings of a DSPy program, including both the prompts and the language model (LM) weights, aiming to enhance specified metrics, such as accuracy.

DSPy features a variety of built-in optimizers, each employing distinct approaches. Essentially, a DSPy optimizer requires three key components:

  1. Your DSPy Program: This could range from a straightforward single module, like dspy.Predict, to a more elaborate setup involving several modules working together.

  2. Your Chosen Metric: A function that assesses your program's output by assigning it a score, with higher scores indicating better performance.

  3. A Set of Training Inputs: Even a modest amount (about 5 to 10 examples), which might be incomplete (consisting only of inputs for your program without corresponding outputs), can be sufficient.

DSPy's ability to work with limited data means you can begin with a small dataset and still achieve impressive outcomes. However, if you have access to a larger dataset, DSPy is capable of utilizing it to potentially yield even better results.

All the available optimizers can be accessed via :

dspy.teleprompt import *.

Available Optimizers in DSPy

DSPy Compiler

The DSPy compiler enhances your program's effectiveness by optimizing it for specific metrics like quality improvement or cost reduction, adapting its strategy based on the LM type:

  • For LLMs: It generates high-quality, task-specific few-shot prompts.

  • For Smaller LMs: It focuses on precise automatic fine-tuning.

This compiler intelligently combines prompting, fine-tuning, reasoning, and augmentation to refine program modules. It simulates various iterations, using these insights for continuous module enhancement, akin to neural network training.

For example, an initial ChainOfThought prompt serves as a basic task introduction for any LM but may not be optimal. The DSPy compiler fine-tunes such prompts, obviating manual adjustments and optimizing program performance with less effort.

How the DSPy compiler optimizes the initial prompt. Credits

Multi-Hop Question Answering using DSPy

Problem Statement:

In the realm of complex Question Answering (QA) tasks, relying on a single search query frequently proves inadequate. This limitation becomes particularly evident in cases requiring multi-faceted information retrieval, such as identifying the birthplace of a specific individual based on literary work. For instance, when tasked with determining the birth city of the writer who penned "Right Back At It Again," a straightforward search query might correctly pinpoint "Jeremy McKinnon" as the author yet falls short in subsequently locating his place of birth.

To address these complexities, the field of retrieval-augmented Natural Language Processing (NLP) has seen the development of multi-hop search systems, exemplified by innovations like GoldEn (Qi et al., 2019) and Baleen (Khattab et al., 2021).

You can access the entire notebook here.

Start with installing the DSPy library.

pip install dspy-ai

We'll be using ChatGPT 3.5 for this use case.

Next, load the dataset, We will be using HotPotQA dataset, a collection of complex question-answer pairs typically answered in a multi-hop fashion.

Now, build the signature. We'll start by creating the GenerateAnswer signature that'll take context and question as input and give answer as output.

Now that we have the necessary signatures in place, we can start building the pipeline!

As we can see, the init method defines a few key sub-modules:

generate_query: For each hop, we will have one dspy.ChainOfThought predictor with the GenerateSearchQuery signature.

retrieve: This module will conduct the search using the generated queries over our defined ColBERT RM search index via the dspy.Retrieve module.

generate_answer: This dspy.Predict module will be used with the GenerateAnswer signature to produce the final answer.

The forward method uses these sub-modules in simple control flow.

  • First, we'll loop up to self.max_hops times.

  • In each iteration, we'll generate a search query using the predictor at self.generate_query[hop].

  • We'll retrieve the top-k passages using that query.

  • We'll add the (deduplicated) passages to our context accumulator.

  • After the loop, we'll use self.generate_answer to produce an answer.

  • We'll return a prediction with the retrieved context and predicted answer.

Let's execute this program in its zero-shot (uncompiled) setting.

This doesn't necessarily imply the performance will be bad but rather that we're bottlenecked directly by the reliability of the underlying LM to understand our sub-tasks from minimal instructions. Often, this is perfectly fine when using the most expensive/powerful models (e.g., GPT-4) on the easiest and most standard tasks (e.g., answering simple questions about popular entities).


Question: How many storeys are in the castle that David Gregory inherited?
Predicted Answer: five
Retrieved Contexts (truncated): ['David Gregory (physician) | David Gregory (20 December 1625 – 1720) was a Scottish physician and inventor. His surname is sometimes spelt as Gregorie, the original Scottish spelling. He inherited Kinn...', 'St. Gregory Hotel | The St. Gregory Hotel is a boutique hotel located in downtown Washington, D.C., in the United States. Established in 2000, the nine-floor hotel has 155 rooms, which includes 54 del...', 'Karl D. Gregory Cooperative House | Karl D. Gregory Cooperative House is a member of the Inter-Cooperative Council at the University of Michigan. The structure that stands at 1617 Washtenaw was origin...', 'Kinnairdy Castle | Kinnairdy Castle is a tower house, having five storeys and a garret, two miles south of Aberchirder, Aberdeenshire, Scotland. The alternative name is Old Kinnairdy....', 'Kinnaird Castle, Brechin | Kinnaird Castle is a 15th-century castle in Angus, Scotland. The castle has been home to the Carnegie family, the Earl of Southesk, for more than 600 years....', 'Kinnaird Head | Kinnaird Head (Scottish Gaelic: "An Ceann Àrd" , "high headland") is a headland projecting into the North Sea, within the town of Fraserburgh, Aberdeenshire on the east coast of Scotla...']

However, a zero-shot approach quickly falls short for more specialized tasks, novel domains/settings, and more efficient (or open) models.

To address this, DSPy offers compilation. Let's compile our multi-hop (SimplifiedBaleen) program.

Let's first define our validation logic for compilation:

The predicted answer matches the gold answer. The retrieved context contains the gold answer. None of the generated queries is rambling (i.e., none exceeds 100 characters in length). None of the generated queries is roughly repeated (i.e., none is within 0.8 or higher F1 score of earlier queries).

We'll use one of the most basic teleprompters in DSPy, namely, BootstrapFewShot to optimize the predictors in pipeline with few-shot examples.

Let's now define our evaluation function and compare the performance of the uncompiled and compiled Baleen pipelines. While this devset does not serve as a completely reliable benchmark, it is instructive to use for this tutorial.



In conclusion, this article has shed light on DSPy, a cutting-edge framework dedicated to the algorithmic enhancement of language model prompts and weights. We delved into the capabilities of DSPy, highlighting its potential to revolutionize Large Language Model (LLM) applications, particularly in Retrieval-Augmented Generation (RAG) tasks.

Featuring an intuitive Pythonic syntax, DSPy's user experience parallels that of working with PyTorch. The process involves setting up a training dataset, defining a DSPy program (akin to a model), implementing bespoke validation logic, compiling the program, and engaging in a process of continuous improvement and refinement.


969 views0 comments

Recent Posts

See All


bottom of page