TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial…

Follow publication

Cleaning & Preprocessing Text Data for Sentiment Analysis

Muriel Kosaka
TDS Archive
Published in
4 min readNov 23, 2020

--

Photo by Jan Antonin Kolar on Unsplash

Sentiment analysis for text data combined natural language processing (NLP) and machine learning techniques to assign weighted sentiment scores to the systems, topics, or categories within a sentence or document. In business setting, sentiment analysis is extremely helpful as it can help understand customer experiences, gauge public opinion, and monitor brand and product reputation.

For this example, we are examining a dataset of Amazon Alexa reviews which can be found here on Kaggle.

First, let’s import the necessary libraries:

import re
import pandas as pd
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
import spacy

Next, let’s read in our .csv file and see the first few rows:

df=pd.read_csv('amazon_alexa.tsv', sep='\t')
df.head()
Image by Author

After further examining, we see that rating ranges from 1–5 and feedback is categorized as either 0 or 1 for each review, but for right now we’ll just focus on the verified_reviews column.

I initialize Spacy ‘en’ model, keeping only the component need for lemmatization and creating an engine:

nlp = spacy.load('en', disable=['parser', 'ner'])

The first pre-processing step we’ll do is transform all reviews in verified_reviews into lower case and create a new column new_reviews.

df['new_reviews'] = df['verified_reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['new_reviews'].head()
Image by Author

Next, we will remove punctuation:

df['new_reviews'] = df['new_reviews'].str.replace('[^\w\s]','')
df['new_reviews'].head()
Image by Author

Upon further inspection of the reviews, I noticed there were emoji’s used so I will remove those using a function provided by Kamil Slowikowski and apply it to new_reviews.

# REFERENCE : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
df['new_reviews'] = df['new_reviews'].apply(lambda x: remove_emoji(x))

After some transformation, the reviews are much cleaner, but we still have some words that we should remove, namely the stopwords. Stopwords are commonly used words (i.e. “the”, “a”, “an”) that do not add meaning to a sentence and can be ignored without having a drastic effect on the meaning of the sentence.

stop = stopwords.words('english')
df['new_reviews'] = df['new_reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df.head(20)
Image by Author

Lastly, we will implement lemmatization using Spacy so that we can count the appearance of each word. Lemmatization removes the grammar tense and transforms each word into its original form. Another way of converting words to its original form is called stemming. While stemming takes the linguistic root of a word, lemmatization is taking a word into its original lemma. For example, if we performed stemming on the word “apples”, the result would be “appl”, whereas lemmatization would give us “apple”. Therefore I prefer lemmatization over stemming, as its much easier to interpret.

def space(comment):
doc = nlp(comment)
return " ".join([token.lemma_ for token in doc])
df['new_reviews']= df['new_reviews'].apply(space)df.head(20)
Image by Author

To review, the steps used to complete preprocessing our data were:

  1. Make text lowercase
  2. Remove punctuation
  3. Remove emoji’s
  4. Remove stopwords
  5. Lemmatization

Now our text is ready for analysis! There are a lot of ways of preprocessing unstructured text data to make it understandable for computers for analysis. For the next step, I will explore sentiment analysis using VADER (Valence Aware Dictionary and sEntiment Reasoner).

Thank you for reading! :)

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Muriel Kosaka
Muriel Kosaka