Data Preprocessing | Natural Language Processing

4 min readMay 4, 2021

A hands on data preprocessing steps to follow while dealing with text data for NLP modeling.

Overview

Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better.

Data preprocessing is an important step to prepare the data to form a machine learning model can understand. There are many important steps in data preprocessing, such as data cleaning, data transformation, and feature selection. Data cleaning and transformation are methods used to remove outliers and standardize the data so that they take a form that can be easily used to create a model.

Why You Need Data Preprocessing?

By now, you’ve surely realized why your data preprocessing is so important. Since mistakes, redundancies, missing values, and inconsistencies all compromise the integrity of the set, you need to fix all those issues for a more accurate outcome. Imagine you are training a Machine Learning algorithm to deal with your customers’ purchases with a faulty dataset. Chances are that the system will develop biases and deviations that will produce a poor user experience.

Thus, before using that data for the purpose you want, you need it to be as organized and “clean” as possible. There are several ways to do so, depending on what kind of problem you’re tackling. Ideally, you’d use all of the following techniques to get a better data set.

Steps For Data Preprocessing

In this section, we will code common steps involved in text preprocessing.

1) Lower Case

Converting the text into lower case letters.

sent_0 =sent_0.lower()

2) Removing all URLs from data

we are removing URLs(‘http\’) from text using python regular expression.

sent_0 = re.sub(r"http\S+", "", sent_1000)

3) Removing all tags from data

For removing tags we use the Beautiful Soup python package.

soup = BeautifulSoup(sent_0, 'lxml')
text = soup.get_text()

4) Removing words with numbers from data

we are removing words containing numbers from text using python regular expression.

sent_0 = re.sub("\S*\d\S*", "", sent_0).strip()

5)Removing special character from data

we are removing all the special character(‘*,&,#,@…’) from the text.

sent_0 = re.sub('[^A-Za-z0-9]+', ' ', sent_0)

6) Decontracting the words

In this step, we are expanding English language contractions using python.

def decontracted(phrase):
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

7) Removing stop words

stop words are words which are filtered out before or after processing of natural language data Though “stop words” usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools.

nltk.download('stopwords')
eng_stopwords = set(stopwords.words('english'))

8) Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

def stemmer(data):
  review_clean_ps = []
  ps = PorterStemmer()
  for sentance in tqdm(data['review'].values):
    ps_stems = []
    for w in sentance.split():
      if w == 'oed':
        continue
      ps_stems.append(ps.stem(w))  
     
    review_clean_ps.append(' '.join(ps_stems))   
  data['review']=review_clean_ps
  return data

9) Lemmatization

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')def lemmatization(data):
  review_clean_wnl = []
  wnl = WordNetLemmatizer()
  for sentance in tqdm(data['review'].values):
     wnl_stems = []
     token_tag = pos_tag(sentance.split())
     for pair in token_tag:
       res = wnl.lemmatize(pair[0],pos=get_wordnet_pos(pair[1]))
       wnl_stems.append(res)
     review_clean_wnl.append(' '.join(wnl_stems))
  data['review']=review_clean_wnl
  return data

10) Vectorization

Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.

There are different types of vectorization techniques ,we mainly speaks about bag of words and tfidf vectorizer .

#bag of words
vectorizer_bow = CountVectorizer(min_df=10,ngram_range=(1,4), max_features=5000)
vectorizer_bow.fit(X_train.values)  #fitting
x_train_bow= vectorizer_bow.transform(X_train.values) 
x_test_bow= vectorizer_bow.transform(X_test.values)
#tfidf
tfidfvectorizer = TfidfVectorizer(min_df=10,max_features=5000)
text_tfidf = tfidfvectorizer.fit(X_train.values) #fitting

X_train_tfidf =tfidfvectorizer.transform(X_train.values) 
X_test_tfidf =tfidfvectorizer.transform(X_test.values)

You can find my complete solution in my Github Repository ,and if you have any suggestions, please contact me via Linkedin

Data Preprocessing | Natural Language Processing

Overview

Why You Need Data Preprocessing?

Steps For Data Preprocessing

1) Lower Case

Written by Basil K Jose

No responses yet