A complete guide on cleaning Social Media text

mitra mirshafiee
5 min readOct 9, 2020

In this tutorial, we’ll take a look at the general and most important steps of text cleaning before giving it to a machine learning or deep learning model. You can watch the video to get an overall understanding of the process and continue with the code in this article.

Introduction

We spend weeks, days, and hours on constructing deep learning or machine learning models. We add layers, use different packages, and so on just to get slightly better results. But what if looking at the data itself and modifying it a little bit, could help us with both performance and the training time of our model?
If you take a look at any text data that hasn’t been modified beforehand and is generated from real-world users, you’ll notice that people, in general, tend to use language as they please to express their feelings, emotions, and news. As a result, their words and sentences would be just a series of characters that can’t be properly distinguished and interpreted by our natural language processing algorithms and models.

So we need to define and perform a sequence of operations by which we can preprocess the corpus at hand and actually clean out the text. Though there is a large range of natural language processing tasks, the most important steps and preprocessing that you have to keep in mind are the same.

But as Tomas Mikolov, one of the authors of Word2vec says, building a deep learning model that can learn about different words and the semantic relationships between them allows you to do as little cleaning as possible. But still, that little part will play a big role, as it reduces memory usage by shrinking the vocabulary size and helps you identify more words by deleting unnecessary characters around them.

For Writing this I used the Real or Not kaggle NLP competition data and you can see the full implementation of the code in this OneDrive notebook.

The seven steps are:

Substituting emojis and emoticons

In cleaning, you might prefer to remove all the punctuations at first and therefore the all the emoticons that are made from them, like :), :( and :|. But by doing this you’re actually removing parts of the meaning. The better way of handling punctuations is to first try to substitute these parts and then delete the remaining.

Remove URLs

Depending on your text and from where it was generated from, we may have different formats of links in our dataset. As an example, tweets from Twitter all have links with http or https in the beginning and then t.co and a series of words and numbers at the end.

Tokenization

Because our models can not understand sentences as a whole, we split our text into separate chunks which are called tokens. We can approach this task by simply using .split() but as you can see in the code below, this will result in tokens like “downtown.” or “he’s” which are not what we want. For avoiding this type of writing, we simply use other packages’ tokenizers that know where to separate words from the punctuations around them.

There are many tokenizers with different functionalities, but here we’ll use word_tokenize from NLTK as a general example.

Removing Stopwords

We all know how frequently words like ‘is’, ‘are’, ‘am’, ‘he’, ‘she’, are used. These words are called stopwords, and they’re so commonly used that appear in all sorts and types of sentences. They don’t have any specific information to add to a sentence that may change the meaning completely, so we simply ignore them while performing tasks like text classification. Google ignores them both when indexing entries for searching and when retrieving them as the result of a search query.

There are different libraries like nltk and spacy with different sets and number of stop words, so depending on how much you feel free to remove parts of the text you can choose one. ( NLTK has around 180, but Spacy has around 360 stop words)

Removing Punctuations

Punctuations can have a big impact on the emotion that the text is trying to convey, but as they are used in all types of sentences, they don’t again like stopwords impart any further knowledge about the text at hand. Removing them also helps us reduce our vocabulary size which can speed up our algorithm and training time.

By using maketrans we make an empty table, in which the third argument lists what should be removed from the text, So during the translations process, all the punctuation will be deleted from the tokens.

Lower casing

Usually, lowercasing can hugely reduce the size of the vocabulary. It will substitute all the capitalized letters with their small form like, “Another”, “There”, will become “another”, “There”. But pay close attention that at the same time it robs some words like “Bush”, “Bill”, “Rocket” form their accurate representation and also meaning by turning them into “bush”, “bill”, “rocket”. You can simply lowercase your words with .lower()

Lemmatization or stemming?

The purpose of both lemmatization and stemming is the same. They both want to relate different forms of verbs, nouns, in general words, to their base form but they do this in different ways. Stemming is the process of chopping off the end of words in the hope of getting a simple and correct shape of the words. But lemmatization is the process of doing this properly with the use of a dictionary. So if we give “studies” to a stemmer, it will return “studi”, but if we give it to a lemmatizer, it will output “study”. Both of these functions tend to reduce your vocabulary size and therefore the variety in your text. So be careful about the tradeoff between the performance and the training time of your model.

Lemmatization with nltk:

Stemming with nltk:

Optional but better lemmatizer

if we want to lemmatize all the words and not only one type of them like verbs, we first have to define the parts of speech to the nltk lemmatizer, so we use a tagger to detect the type and then tell the lemmatizer to do its job.

And for the last step, if you’re using a deep learning model with a dictionary, you go through your dataset and check which words were not recognized by your algorithm and then try to find out ways that can reduce those words. You may even consider manually correcting some words like “Goooaaaal” or “Snaaap”.

Conclusion

In this era of history, we see computers and machines rise in every aspect of our lives, doing our chores, connecting us, and etc. In return we have to help them understand our language better and make the interaction easier for both us humans and machines. Cleaning is just one of the ways that’ll help us make faster and more accurate models, and because it’s modifying the main text, we have to be careful and construct a good procedure to remove as little meaning and parts as possible.

Thanks for reading this article! I hop this was useful to you and if you did please make sure to give it a Clap! 😉

--

--