Step by step introduction to Word Embeddings and Bert embeddings

mitra mirshafiee
4 min readOct 1, 2020

--

Hi! In this tutorial, we’ll be looking at word embeddings and see how BERT uses them to get the most out of our text and also preserve the meaning. You can both follow along with the animated tutorial and read the instructions in this article.

Word embedding with Bert

First step: Why and how to preprocess the text?

Well, because computers cannot take whatever sentence we give them and just simply read as we do, we have to take some steps to help them understand each sentence. These steps are called preprocessing.

The first step for preprocessing and making the text ready for our model is tokenization. we take each sentence and then tokenize it so we have a series of separated words for each.

Then we have to change these words into digits or any format that can be read by a computer. We can do this by simply one-hot-encoding our text(giving one vector to every word, in which all the values are zero except the one that is specified to that word). But this technique is essentially corrupt! why? As we are only giving ones and zeros to our computer, it has no way to interpret any meaning out of our vectors. For more clarity and intuition think about these two sentences:

‘Mary has a crush on Charlie’ and ‘The terrorists will crush the tower’

In these two sentences when we tokenize and one-hot-encode our text neither of the vectors specified to Mary, Charlie, and tower have any similarity to one another. In a more technical term, they’re all perpendicular to each other and their cosine similarity (the cosine of the angle between them) is always zero.

So because we want to show our machine learning or deep learning model that Mary and Charlie are both humans and Tower is not, we change the way we represent each word with Word Embeddings.

Now compare this with how word embeddings represent each word:

Word2Vec embeddings (after reducing its 300d into 3d with PCA)

Second Step: Word Embeddings

So now we know that we know how important representation of each word is, we go on to using something that can actually help our model distinguish between words. For that, we have to take two primary steps and then we’ll go into embeddings.

The first step is to add two more tokens at the end and the beginning of our sentences; These two tokens are ‘[CLS]’ and ‘[SEP]’ and they essentially help us differentiate between sentences but they also have some other functionality that we’ll cover at the end of this Article.

The second step is to index our words; we specify one number or index for each word.

The third step is padding; because deep learning models take inputs of the same length, we add zeros at the end of the sentences that are shorter than our maximum lengthed sentence.

3 steps before Word Embeddings

In the next step, we will be using word embedding. For that, we take each word and then assign one specific vector to each. In these vectors, each value is showing one aspect of the meaning of all the words. Take a look at the graph below. Here if the third column shows the Femenity of the words, Mary would have a high, and Charlie would have a low value in it.

Embedding Vectors

So till now, you learned about the functionality and importance of word embedding. Now let's take BERT word embeddings and see what advantages it has over other embeddings. When we give words like ‘Crush’ that can have several different meanings to embeddings like GloVe or Word2vec, they specify only one vector to all different meanings of this word. But if we give the same word to BERT, it is smart enough to distinguish between different meanings of one word. That is because it gives every word one embedding based on the context that it was used in. That is why when you look at the BERT dictionary and word indices you’ll see more than one embedding for one word.

So what is CLS for in BERT?

As you remember we said we add two more tokens to all the words. One of them was CLS. This token is the short form of Classification and is used in classifying different forms of the text. As we said Bert specifies one embedding for each word based on its context so here too, CLS has a different embedding in different sentences and context so we can use only one token’s embeddings and decide for the label of the text!

Thank you for reading this article. I hope it was useful for you. Feel free to add your comments and also your questions in the comments down below and help me write more by giving this article a clap and sharing it with other learners!

Keep learning!

--

--

mitra mirshafiee
mitra mirshafiee

Written by mitra mirshafiee

Passionate reader, data scientist and artist!

Responses (2)