top of page
  • Writer's pictureNagesh Singh Chauhan

Natural language processing in Apache Spark using NLTK (part 1/2)

In the very basic form, Natural language processing is a field of Artificial Intelligence that explores computational methods for interpreting and processing natural language, in either textual or spoken form.



In this series of 2 blogs I’ll be discussing Natural Language Processing, NLTK in Spark, environment setup, and some basic implementations in the first one, and how we can create an NLP application that is leveraging the benefits of Bigdata in the second.


What is Natural Language?


A Natural language or Ordinary language is any language that has evolved naturally with time in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech, signing, or text.


Think about how much text you see each day:


Signs, Menus, Email, SMS, Web Page and so much more… The list is endless. Now think about speech. We may speak to each other, as a species, more than we write. It may even be easier to learn to speak than to write.


Given the importance of this type of data, we must have methods to understand and reason about natural language, just like we do for other types of data.


Human language is highly ambiguous … It is also ever-changing and evolving. People are great at producing language and understanding language and are capable of expressing, perceiving, and interpreting very elaborate and nuanced meanings. At the same time, while we humans are great users of language, we are also very poor at formally understanding and describing the rules that govern language.



What is Natural Language Processing?


Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken. NLP is a component of artificial intelligence (AI). The development of NLP applications is challenging because computers traditionally require humans to “speak” to them in a programming language that is precise, unambiguous, and highly structured, or through a limited number of clearly enunciated voice commands. Human speech, however, is not always precise — it is often ambiguous and the linguistic structure can depend on many complex variables, including slang, regional dialects, and social context.


The history of natural language processing generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled “Intelligence” which proposed what is now called the Turing test as a criterion of intelligence.


How natural language processing works?


Current approaches to NLP are based on deep learning, a type of AI that inspects and uses patterns in the data to improve a program’s understanding. Deep learning models require massive amounts of labeled data to train on and identify relevant correlations, and assembling this kind of big data set is one of the main hurdles to NLP currently.


What are the common real-world NLP Implementations?


Some successful implementations of Natural language processing (NLP) for example lets say search engines like Google, Yahoo, etc.


Google’s search engine understands that you are a tech guy, so it shows you results related to that.

Social media website feeds like your Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related ads and posts more likely than other posts.

Speech engines like Apple Siri.

Spam filters like Google spam filters. It’s not just about your usual spam filtering; now, spam filters understand what’s inside the email content and see if it’s spam or not.

To implement NLP we have some useful tools available in the market like:

  1. CoreNLP from Stanford group

  2. NLTK, the most widely-mentioned NLP library for Python

  3. TextBlob, a user-friendly and intuitive NLTK interface

  4. Gensim, a library for document similarity analysis

  5. SpaCy, an industrial-strength NLP library built for performance

Click here to explore more about all of the above.


In this blog, I’m going to use NLTK for natural language processing.


Natural Language Toolkit(NLTK)

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.


Let us get our hands dirty and try to implement some cool NLP stuff. Below are the steps that I will follow throughout this post.

  1. NLTK environment setup and Installation in Apache Spark

  2. Word tokenize

  3. Remove Stopwords

  4. Remove punctuations

  5. Part of speech tagging

  6. Named Entity Recognition

  7. Lemmatization

  8. Text Classification

So let us start with our implementation:

  1. NLTK environment setup and Installation in Apache Spark :

Download Miniconda (for Python 2.7)


for Python 3.5


This will install miniconda at your users' directory by setting the prefix. Then do this.

$ bash Miniconda...sh

After installation, accept the change to the .bashrc file, logout from the computer, and log in again.


Create a CONDA environment:

source ~/.bashrc
conda create -n nltk_env --copy -y -q python=2 nltk numpy pip

Activate the nltk_env and execute these commands:

$ source activate nltk_env
(nltk_env)$ pip install any_python_package
(nltk_env)$ cp -r /usr/local/lib ~/miniconda2/envs/nltk_env/
(nltk_env)$ python -m nltk.downloader -d nltk_data all
(nltk_env)$ hdfs dfs -put nltk_data/corpora/state_union/1970-Nixon.txt ./
$ cd ~/miniconda2/envs/
$ zip -r nltk_env.zip nltk_env# archive nltk data for distribution. Check where nltk_data folder is there(nltk_env)$ cd /nltk_data/tokenizers/
(nltk_env)$ zip -r ../../tokenizers.zip *
(nltk_env)$ cd /nltk_data/taggers/
(nltk_env)$ zip -r ../../taggers.zip *

Next, open your spark-defaults.conf paste below spark properties :

PYSPARK_PYTHON=./NLTK/nltk_env/bin/python spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./NLTK/nltk_env/bin/python \
--conf spark.yarn.appMasterEnv.NLTK_DATA=./ \
--master yarn-cluster \
--archives nltk_env.zip#NLTK,tokenizers.zip#tokenizers,taggers.zip#taggers \

I guess now we have successfully installed NLTK in the virtual environment ‘nltk_env’. Please let me know in the comment section if you face any issue in the installation process.


Now we are all set to start working in spark NLTK. First, we will create a SparkContext. Note that Anaconda for cluster management will not create a SparkContext by default and also we are going to use YARN as a resource manager.


from pyspark import SparkConf
from pyspark import SparkContextconf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('spark-nltk')
sc = SparkContext(conf=conf)

Load the data which we have already kept in hdfs. The data file is from one of the example documents provided by NLTK.

data = sc.textFile('hdfs:///user/spark/warehouse/1972-Nixon.txt')

Let's check how the data looks as of now, as we can see that the data is already tokenized by the sentences, so next, we will perform word tokenize in the next step.

This is our data when no processing has been done.

2. Word tokenization.


Text data can be split into words using the method word_tokenize().

#word tokenizer
def word_tokenize1(x):
    import nltk
    lowerW = x.lower()
    return nltk.word_tokenize(x)words = data.flatMap(word_tokenize1)
print words.collect()

word tokenize

3. Removing Stop words.


A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore.

from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))
stopW = words1.filter(lambda word : word[0] not in stop_words and word[0] != '')
#print stopW.collect()

After the removal of stop words, our data looks like :



Our data after removal of stop words


4. Remove Punctuations from our data.

import string
list_punct=list(string.punctuation)
filtered_data = stopW.filter(lambda punct : punct not in list_punct)
print filtered_data.colect()

You can also use a regex expression to remove punctuation.

'!()-[]{};:'"\,<>./?@#$%^&*_~' # regex for punctuation

After the removal of punctuations our data looks like :



Data after removal of punctuation marks

5. Part of speech tagging.


A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads the text in some language and assigns parts of speech to each word (and other tokens), such as noun, verb, adjective, etc.

def pos_tag(x):
    import nltk
    return nltk.pos_tag([x])pos_word = filtered_data.map(pos_tag)
print pos_word.collect()


Part of speech tagging

6. Named entity recognition.


Named Entity Recognition is probably the first step towards information extraction from unstructured text. It basically means extracting what is a real-world entity from the text (Person, Organization, Event, etc …).

def named_entity_recog(x):
    import nltk
    nltk.download('maxent_ne_chunker')
    nltk.download('words')
    return nltk.ne_chunk([x])NER_word = filtered_data.map(named_entity_recog)
print NER_word.collect()

Named Entity Recognition

7. Lemmatization.


Stemming and Lemmatization are the basic text processing methods for English text. The goal of both of them is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. I have skipped Stemming because it is not an efficient method as sometimes it produces words that are not even close to the actual word.

def lemma(x):
    import nltk
    nltk.download('wordnet')
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return lemmatizer.lemmatize(x)lem_words = filtered_data.map(lemma)
print lem_words.collect()

Lemmatization of words

8. Text Classification


Here we are going find the words which have the highest frequency and sort them in decreasing order of their frequency.

text_Classifi = filtered_data.flatMap(lambda x : nltk.FreqDist(x.split(",")).most_common()).map(lambda x: x).reduceByKey(lambda x,y : x+y).sortBy(lambda x: x[1], ascending = False)
topcommon_data = text_Classifi.take(100) 

#take first 100 most common words
print topcommon_data.collect()

Classfication of Text.


Here you can also find the number of occurrence of one particular word and many more similar kinds of operations.


I’ll stop here only, and in part 2 of Natural language processing in Spark using NLTK, we will try to build an NLP application.


Hope this helps !!! Thanks for reading.


Happy Learning :)

2,655 views0 comments

Recent Posts

See All
bottom of page