Cleaning & Preprocessing Text Data for Sentiment Analysis

Simple steps for beginners for how to clean and preprocess Amazon Alexa reviews for sentiment analysis.

Published in

TDS Archive

4 min readNov 23, 2020

Sentiment analysis for text data combined natural language processing (NLP) and machine learning techniques to assign weighted sentiment scores to the systems, topics, or categories within a sentence or document. In business setting, sentiment analysis is extremely helpful as it can help understand customer experiences, gauge public opinion, and monitor brand and product reputation.

For this example, we are examining a dataset of Amazon Alexa reviews which can be found here on Kaggle.

First, let’s import the necessary libraries:

import re
import pandas as pd
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
import spacy

Next, let’s read in our .csv file and see the first few rows:

df=pd.read_csv('amazon_alexa.tsv', sep='\t')
df.head()

After further examining, we see that rating ranges from 1–5 and feedback is categorized as either 0 or 1 for each review, but for right now we’ll just focus on the verified_reviews column.

I initialize Spacy ‘en’ model, keeping only the component need for lemmatization and creating an engine:

nlp = spacy.load('en', disable=['parser', 'ner'])

The first pre-processing step we’ll do is transform all reviews in verified_reviews into lower case and create a new column new_reviews.

df['new_reviews'] = df['verified_reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['new_reviews'].head()

Next, we will remove punctuation:

df['new_reviews'] = df['new_reviews'].str.replace('[^\w\s]','')
df['new_reviews'].head()

Upon further inspection of the reviews, I noticed there were emoji’s used so I will remove those using a function provided by Kamil Slowikowski and apply it to new_reviews.

# REFERENCE : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags 
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)df['new_reviews'] = df['new_reviews'].apply(lambda x: remove_emoji(x))

After some transformation, the reviews are much cleaner, but we still have some words that we should remove, namely the stopwords. Stopwords are commonly used words (i.e. “the”, “a”, “an”) that do not add meaning to a sentence and can be ignored without having a drastic effect on the meaning of the sentence.

stop = stopwords.words('english')
df['new_reviews'] = df['new_reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))df.head(20)

Lastly, we will implement lemmatization using Spacy so that we can count the appearance of each word. Lemmatization removes the grammar tense and transforms each word into its original form. Another way of converting words to its original form is called stemming. While stemming takes the linguistic root of a word, lemmatization is taking a word into its original lemma. For example, if we performed stemming on the word “apples”, the result would be “appl”, whereas lemmatization would give us “apple”. Therefore I prefer lemmatization over stemming, as its much easier to interpret.

def space(comment):
    doc = nlp(comment)
    return " ".join([token.lemma_ for token in doc])df['new_reviews']= df['new_reviews'].apply(space)df.head(20)

To review, the steps used to complete preprocessing our data were:

Make text lowercase
Remove punctuation
Remove emoji’s
Remove stopwords
Lemmatization

Now our text is ready for analysis! There are a lot of ways of preprocessing unstructured text data to make it understandable for computers for analysis. For the next step, I will explore sentiment analysis using VADER (Valence Aware Dictionary and sEntiment Reasoner).

Thank you for reading! :)

TDS Archive

Cleaning & Preprocessing Text Data for Sentiment Analysis

Simple steps for beginners for how to clean and preprocess Amazon Alexa reviews for sentiment analysis.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in TDS Archive

Written by Muriel Kosaka