top of page
  • Writer's pictureNagesh Singh Chauhan

Zero-Shot Learning: Can you classify an object without seeing it before?

Developing machine learning models that can perform predictive functions on data it has never seen before has become an important research area called zero-shot learning. We tend to be pretty great at recognizing things in the world we never saw before, and zero-shot learning offers a possible path toward mimicking this powerful human capability.


The recent release of GPT-3 got me interested in the state of zero-shot learning and few-shot learning in NLP. While most of the zero-shot learning research is centered around Computer Vision, there has been some interesting work in the NLP domain as well.

Over the last few decades, machines have become much more intelligent, but without a properly labeled training data set of seen classes, they cannot distinguish between two similar objects. On the other hand, humans are capable of identifying approximately 30,000 basic object categories. In machine learning, this is considered as the problem of Zero-shot learning (ZSL).

Let us consider an example of how a child would have no problem recognizing a zebra if it has seen a horse before and read somewhere that a zebra looks similar to a horse but has black-and-white stripes.

Can you classify an object without ever seeing it?

Yes, you can if you have adequate information about its appearance, properties, and functionality. Think back to how you came to understand the world as a kid. You could spot Mars in the night sky after reading about its color and where it would be that night, or identify the constellation Cassiopeia from only being told: “it’s basically a malformed ‘W’.”

What is Zero-Shot Learning?

ZSL is a problem setup in machine learning, where at testing, a learner observes samples from classes that were not observed while training the model and predicts the category they belong to. Zero-shot methods basically work by combining the observed/seen and non-observed/unseen categories through some types of auxiliary information, which encodes observable distinguishing properties of objects.

Without accessing any data of the unseen categories during training the models, yet it is able to build and train models with the help of transferring intelligence from previously seen categories and auxiliary information.

For example, given a set of images of animal classification use case, along with auxiliary textual descriptions about animals look like, an Artificial Intelligence (AI) model which has been trained to identify horses, but has never seen a zebra, can still identify a zebra if it also knows that zebras look like striped horses. The auxiliary information may include attributes, textual descriptions, or vectors of word category labels. This type of use case is majorly studied in computer vision (CV), natural language processing (NLP), and machine perception.

ZSL is done in two stages:

  1. Training: Where the knowledge about the attributes is captured

  2. Inference: The knowledge is then used to categorize instances among a new set of classes.

How Zero-shot Learning works?

There are two common approaches used to solve the zero-shot recognition problems.

1. Embedding based approach

The main goal of this method is to map the image features and semantic attributes into a common embedding space using a projection function, which is learned using deep networks.

Let us look at an example of these methods to understand them in depth.

While training, the aim is to find a projection function from visual space to semantic space (that is, word vectors or semantic embedding) using information from seen categories. Since neural networks are used as function approximators, the projection function is learned as a deep neural network.

While in the testing phase, the non-observed category image feature data are put as input to the trained model, and we get the relative semantic embedding as a result. After which, to do classification, we try with nearest neighbor search in the semantic attribute space to find the best similar result to the output of the network. Finally, the category corresponding to the nearest semantic embedding is predicted as the final category of the input image feature.

The figure above shows the anatomy of a typical embedding-based zero-shot learning method. The input image is initially passed through a feature extractor network (deep neural network (DNN)) to get an N-dimensional feature vector for the image. This vector acts as the input to the main network, which returns the result of a D-dimensional output vector. The end goal is to calculate the weights of the projection network so as to map the N-dimensional input to a D-dimensional output. To get this, we put a loss that measures the compatibility between the D-dimensional output and ground truth semantic attribute. The weights of the network are trained such that the D-dimensional output is as close as possible to the ground truth data.

2. Generative model-based approach

The main drawback with embedding-based methods is that they suffer from the issue of bias and domain shift. This means that since the projection function is learned using only seen classes during training, it will be biased towards predicting seen category labels as a result. There is also no surety that the trained projection function will rightly map non-observed category image features to the corresponding semantic space correctly at the testing phase. This is due to the fact that the deep network has only learned to map seen category image features to semantic space during training and might not be able to do the same for the novel non-observed category at the testing phase correctly.

To be able to overcome this drawback, it is important that our zero-shot classification model is trained on both seen and non-observed category images at train time. This is where this type of models-based method is used.

The generative method’s goal is to generate image features for non-observed categories using semantic attributes. Generally, this is done using a conditional generative adversarial network (cGAN) that generates image features conditioned on the semantic attribute of a given category.

The figure below depicts the diagram of a general generative model-based zero-shot learning. Identical to the embedding-based method, we use a feature extractor network to get an N-dimensional feature vector. First, the attribute vector is input to the generative model as displayed in the diagram. The generator generates an N-dimensional output vector conditioned on the attribute vector. The generative model is trained such that the synthesized feature vector looks identical to the original N-dimensional feature vector.

When the generative model is modeled, we fix the weights of the generator and pass the class attributes as input to it to generate non-observed category image features. Once we have seen class image features (the training dataset) and non-observed category image features, we can train a basic image classifier that takes images as the input features and outputs the respective category label as shown in the figure.

Evaluation Metric for Zero-shot learning methods

Generally, image recognition/image classification models use Top-1 accuracy as their evaluation metric. But, the evaluation metric used for zero-shot learned recognition models is different from that used for vanilla image classification models.

We use the average per category Top-1 accuracy to evaluate zero-shot recognition results.

Mathematically, for a set of classes Y with N classes, the average per class top one accuracy is given by:

Then we calculate the accuracy for each class individually and then average it across all other categories. This encourages high performance in both sparsely and densely populated classes. In the method of a generalized zero-shot setting, our goal is for high accuracy on both seen classes as well as a set of non-observed categories. Thus the performance metric is defined as the harmonic mean of performance on seen classes and non-observed categories.


We will use huggingface to implement zero-shot learning as it’s easy to use and supports numerous NLP tasks. The huggingface libraries also made available its zero-shot-classification pipeline with the capabilities to perform text classification, sentiment classification, and topic modeling without the necessity of having any labeled data or training.

Key Steps:

  1. First, we need to install and import the pipeline.

  2. Next, define the input text sequence and candidate labels.

  3. Finally, we run the classifier.

For more details, check here.

Zero-shot Text Classifier

In the zero-shot text classification method, the already trained model can classify any text information given without having any specific information about data.

We can install the available huggingface transformers with the following code:

pip install transformer

We can then import and define the pipeline with zero-shot-classification,

from transformers import pipeline
classifier = pipeline(“zero-shot-classification”)

There are different approaches to use the zero-shot classification:

  1. We can pass input as a sequence and candidate_tags. Once the input is passed with the required information, the pipeline returns the output with weights similar to a softmax activation function where all category probability is added up to 1, and all are dependent.

sequence = "How is the weather today?"candidates_tags = ["climate", "environment", "economics"]classifier(sequence, candidates_tags)


{‘labels’: [‘environment’, ‘climate’, ‘economics’], ‘scores’: [0.5216448307037354, 0.46808499097824097, 0.010270169004797935], ‘sequence’: ‘How is the weather today?’}

2. Use multiclass attribute to the above use case we can pass in an attribute that is multi_class=True. The weights will be independent. Each category probability will have a probability rule in the range of 0 and 1.

sequence = "How is the weather today?"candidate_tags = ["climate", "environment", "economics", "elections"]classifier(sequence, candidate_tags, multi_class=True)  


{‘labels’: [‘climate’, ‘environment’, ‘economics’, ‘elections’], ‘scores’: [0.9449107646942139, 0.9024710655212402, 0.00020360561029519886, 0.00011781518696807325], ‘sequence’: ‘How is the weather today?’}  

How does text classification work?

The zero-shot text classification model is trained on Natural Language Interface (NLI). Text classification is the process of categorizing the text into a set of words. By using NLI, text classification can automatically perform text analysis and then assign a set of predefined tags or categories based on its context. In this model, zero-shot classification is similar to that having the sequence as the one sequence to the NLI model. Also, one of the labels is another sequence. Next, the model checks whether which label contains the first and second sequence then sends the prediction as output.


Zero-shot learning is an important method. It is a comparatively new topic as still there are lots of research in many domain fields. Apart from standard machine learning, where classifiers are expected to correctly classify new samples to classes, they have already identified or labeled during the training of the classifier model, in zero-shot learning, no samples from the classes have been given during training the model.

Code is available here.


636 views0 comments

Recent Posts

See All


bottom of page