Sample Text Data Preprocessing Implementation In SparkNLP

In this article, I will try to examine some basics of data pre-processing steps and implement some of this in SparkNLP.

Ahmet Emin Tek
8 min readNov 4, 2020

Firstly, if necessary to define data pre-processing, as basically we can define like this; data pre-processing is transforming the raw data into the desired format. We apply to data some pre-processing steps according to our tasks and aims. There are many kind of data processing steps in case by case(cases: image, text, voice etc.).

In any NLP task, we may have dirty data. It means data can have many irrelevant, missing or undesirable parts. For getting ride of this situations we apply some data cleaning methods. Also we may have noisy data as well. These data can’t interpret by machines. There are many reason that this and similar to these reasons for applying some preprocessing techniques.

Let’s apply some preprocessing steps to sample data by using SparkNLP.
In this article, I will use Sms Spam Collection Data as sample.

In this study, I will create all annotators one by one and subsequently collect them into one pipeline for completing the pre-processing. Here is which annotators I will create:
->DocumentAssembler
->Tokenizer
->SentenceDetector
->Normalizer
->StopWordsCleaner
->TokenAssembler
->Stemmer
->Lemmatizer

Importing SparkNLP, necessary libraries, reading the data from local and converting into spark dataFrame like following:

import sparknlp
spark= sparknlp.start()
from sparknlp.base import *
from sparknlp.annotator import *
df= spark.read\
.option("header", True)\
.csv("spam_text_messages.csv")\
.toDF("category", "text")
df.show(5, truncate=30)
>>>+--------+------------------------------+
|category| text|
+--------+------------------------------+
| ham|Go until jurong point, craz...|
| ham| Ok lar... Joking wif u oni...|
| spam|Free entry in 2 a wkly comp...|
| ham|U dun say so early hor... U...|
| ham|Nah I don't think he goes t...|
+--------+------------------------------+
only showing top 5 rows

Annotators and transformers are come from base and annotator libraries. I won’t give details what is annotators and transformers throughout article.

We have two columns which names are category and message. Text column consist messages and category column consist type of messages, spam or not(ham).

1- Document Assembler

DocumentAssembler is the begining part of the any SparkNLP project. It creates the first annotation of type Document which may be used by annotators down the road. documentAseembler() comes from SparkNLP’s base class. We can use it like following:

documentAssembler= DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
.setCleanupMode("shrink")

Parameters:
setInputCol() -> The name of the column that will be converted. We can specify only one column here. The significant point here It can read either a String column or an Array[String].
setOutputCol() -> (optional) The name of the column in Document type that is generated. We can specify only one column here. Default is ‘document’.
setCleanupMode() -> (optional) Cleaning up options.

I choiced shrink as clean up mode, it removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.
Now, we will transform the df with documentAssembler by using transform() function and subsequently print the schema as following:

df_doc= documentAssembler.transform(df)
df_doc.printSchema()
>>>root
|-- category: string (nullable = true)
|-- text: string (nullable = true)
|-- document: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)

The annotators and transformers all come with universal metadata in SparkNLP. We can access all these parameters shown above with {column name}.{parameter name}.

In order to view text line by line and beginning and ending of the lines:

df_doc.select("document.result", "document.begin", "document.end").show(5, truncate=30)>>>+------------------------------+-----+-----+
| result|begin| end|
+------------------------------+-----+-----+
|[Go until jurong point, cra...| [0]|[110]|
|[Ok lar... Joking wif u oni...| [0]| [28]|
|[Free entry in 2 a wkly com...| [0]|[154]|
|[U dun say so early hor... ...| [0]| [48]|
|[Nah I don't think he goes ...| [0]| [60]|
+------------------------------+-----+-----+
only showing top 5 rows

We can print out the first item’s result:

df_doc.select("document.result").take(1)>>>[Row(result=['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'])]

2- Tokenizer

Tokenizer() uses for identifying the tokens in SparkNLP. Here is to form tokenizer:

tokenizer= Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")

Also, tokenizer function provides so many parameters to make our task conveniently. For example;

setExceptions(StringArray): If you have a list of some composite words that are you don’t want to split, this is useful for you.
contextChars(StringArray): If you don’t want to split some characters such as parentheses, question marks etc. it is useful. It needs string array.
setTargetPattern() : If you want to identify a candidate for tokenization within basic regex rules, it is useful for you. Defaults to \S+ which means anything except a space.

These are general parameters, there are many parameters to use case by case.

3- Sentence Detector

SentenceDetector() finds a sentence bounds in raw text. Here is how I formed it:

sentenceDetector= SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")

An useful parameter here:
setCustomBounds(string): If you want to separate sentences by custom characters, it is for you.

4- Normalizer

normalizer() cleans dirty characters after regex pattern and removes words based on given dictionary. Implementation like following:

normalizer= Normalizer()\
.setInputCols(["token"])\
.setOutputCol("normalized")\
.setLowercase(True)\
.setCleanupPatterns(["[^\w\d\s]"])

Now, let’s explain the parameters.

.setLowercase(True) : It’s lowercasing the tokens, as a default false. .setCleanupPatterns(["[^\w\d\s"]) : It needs regular expressions list for normalization, as a default [^A-Za-z]. If you determine like I did above, it will remove punctuations, keep alphanumeric chars.

5- Stopwords Cleaner

StopWordsCleaner() uses for dropping the stopwords from text. Here is the implementation:

stopwordsCleaner =StopWordsCleaner()\
.setInputCols(["token"])\
.setOutputCol("cleaned_tokens")\
.setCaseSensitive(True)

Explaining of parameters:
.setCaseSensitive(True) : Whether to do a case sensitive comparison over the stop words.
.setStopWords() : If you have the words to be filtered out. It needs Array[String].

6- Token Assembler

TokenAssembler() uses for assembling back again the cleaned tokens. Implementation like following:

tokenAssembler= TokenAssembler()\
.setInputCols(["sentence", "cleaned_tokens"])\
.setOutputCol("assembled")

7- Stemmer

The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Here is the implementation:

stemmer= Stemmer()\
.setInputCols(["token"])\
.setOutputCol("stem")

8- Lemmatizer

Both in stemming and in lemmatization, we are reducing given word to it’s root. However, there are some differences between that. In stemming, the algorithms don’t actually know the meaning of the word in the language it belongs to. In lemmatization, the algorithms have this knowledge. These algorithms refer a dictionary to understand the meaning of the word before reducing.
For example, lemmatization algorithms know that the word “went” is derived from the word “go” and hence, the lemma will be “go”. But a stemming algorithm wouldn’t be able to do the same. There could be over-stemming or under-stemming, and word “went” could be reduced to either “wen” or keep it as “went”.
So, here is the implementation of Lemmatizer() :

We will first pull the lemmatization dictionary from link. Then, implement the lemmatization.

lemmatizer= Lemmatizer()\
.setInputCols(["token"])\
.setOutputCol("lemma")\
.setDictionary("AntBNC_lemmas_ver_001.txt",
value_delimiter="\t", key_delimiter="->")

We selected the dictionary with the .setDictionary() parameter.

Putting All Processes Into a Spark ML Pipeline

Fortunately, all the SparkNLP annotators and transformers can be used within Spark ML Pipelines. We just created annotators and transformers that we need. Now, we will put them into a pipeline. After that, we will fit the pipeline and transform it with our dataset. Let’s start!

Importing spark ml pipeline:
from pyspark.ml import Pipeline

Putting annotators and transformers into the pipeline:

nlpPipeline= Pipeline(stages=[
documentAssembler,
tokenizer,
sentenceDetector,
normalizer,
stopwordsCleaner,
tokenAssembler,
stemmer,
lemmatizer
])

Now, in order to using the pipeline with different dataframes we will create an empty dataframe to fitting the pipeline.

empty_df= spark.createDataFrame([[""]]).toDF("text")
model= nlpPipeline.fit(empty_df)

Nice, we just builded model. It’s time to apply it to our dataset. For do this, we will use transform() function like following;

result= model.transform(df)

Examining The Results

Well, we will examine the results that what we did. But first, we will import functions library from pyspark sql in order to making some process.
from pyspark.sql import functions as F

Let’s begin the examination with tokens and normalized tokens.

result.select("token.result" ,"normalized.result")\
.show(5, truncate=30)
>>>+------------------------------+------------------------------+
| result| result|
+------------------------------+------------------------------+
|[Go, until, jurong, point, ...|[go, until, jurong, point, ...|
|[Ok, lar, ..., Joking, wif,...|[ok, lar, joking, wif, u, oni]|
|[Free, entry, in, 2, a, wkl...|[free, entry, in, 2, a, wkl...|
|[U, dun, say, so, early, ho...|[u, dun, say, so, early, ho...|
|[Nah, I, don't, think, he, ...|[nah, ı, dont, think, he, g...|
+------------------------------+------------------------------+
only showing top 5 rows

Tokens that seems in the first column has normalized in the second column. Let’s check the cleared data from stopwords like following;

result.select(F.explode(F.arrays_zip("token.result",
"cleaned_tokens.result")).alias("col"))\
.select(F.expr("col['0']").alias("token"),
F.expr("col['1']").alias("cleaned_sw")).show(10)
>>>+---------+----------+
| token|cleaned_sw|
+---------+----------+
| Go| Go|
| until| jurong|
| jurong| point|
| point| ,|
| ,| crazy|
| crazy| ..|
| ..| Available|
|Available| bugis|
| only| n|
| in| great|
+---------+----------+
only showing top 10 rows

As you see above, dropped some stopwords by stopwords cleaner.

Token assembler contains normalized tokens. Now, let’s compare the sentence detector result and token assembler result.

result.select(F.explode(F.arrays_zip("sentence.result",
"assembled.result")).alias("col"))\
.select(F.expr("col['0']").alias("sentence"),
F.expr("col['1']").alias("assembled")).show(5,
truncate=30)
>>>+------------------------------+------------------------------+
| sentence| assembled|
+------------------------------+------------------------------+
| Go until jurong point, crazy.| Go jurong point, crazy|
| .| |
|Available only in bugis n g...|Available bugis n great wor...|
| Cine there got amore wat.| Cine got amore wat|
| .| |
+------------------------------+------------------------------+
only showing top 5 rows

Also, we can check the sentence numbers of the words like following;

result.withColumn("tmp", F.explode("assembled"))\
.select("tmp.*").select("begin", "end", "result",
"metadata.sentence").show(5,
truncate=30)
>>>+-----+---+------------------------------+--------+
|begin|end| result|sentence|
+-----+---+------------------------------+--------+
| 0| 21| Go jurong point, crazy| 0|
| 29| 28| | 1|
| 31| 74|Available bugis n great wor...| 2|
| 84|101| Cine got amore wat| 3|
| 109|108| | 4|
+-----+---+------------------------------+--------+
only showing top 5 rows

Seems nice! Now we will compare the tokens, stems and lemmas. Also, in this part we will see how to covert spark dataframe into pandas data frame easily.

You can compare stems and lemmas nicely. For “available” word, stem chanced it as “avail” because stemmmer algorithm doesn’t know what words mean.

Well, we’ve seen some basic preprocessing steps throughout this article.
I suggest you visit JohnSnowLabs that is official developer of SparkNLP in order to access more info and detail. Also, there is great introduction as colab notebook .

Thanks for your read and support!

--

--

No responses yet