Hi. In this video, we will apply neural networks for text. And let's first remember, what is text? You can think of it as a sequence of characters, words or anything else. And in this video, we will continue to think of text as a sequence of words or tokens. And let's remember how bag of words works. You have every word and forever distinct word that you have in your dataset, you have a feature column. And you actually effectively vectorizing each word with one-hot-encoded vector that is a huge vector of zeros that has only one non-zero value which is in the column corresponding to that particular word. So in this example, we have very, good, and movie, and all of them are vectorized independently. And in this setting, you actually for real world problems, you have like hundreds of thousands of columns. And how do we get to bag of words representation? You can actually see that we can sum up all those values, all those vectors, and we come up with a bag of words vectorization that now corresponds to very, good, movie. And so, it could be good to think about bag of words representation as a sum of sparse one-hot-encoded vectors corresponding to each particular word. Okay, let's move to neural network way. And opposite to the sparse way that we've seen in bag of words, in neural networks, we usually like dense representation. And that means that we can replace each word by a dense vector that is much shorter. It can have 300 values, and now it has any real valued items in those vectors. And an example of such vectors is word2vec embeddings, that are pretrained embeddings that are done in an unsupervised manner. And we will actually dive into details on word2vec in the next two weeks. But, all we have to know right now is that, word2vec vectors have a nice property. Words that have similar context in terms of neighboring words, they tend to have vectors that are collinear, that actually point to roughly the same direction. And that is a very nice property that we will further use. Okay, so, now we can replace each word with a dense vector of 300 real values. What do we do next? How can we come up with a feature descriptor for the whole text? Actually, we can use the same manner as we used for bag of words. We can just dig the sum of those vectors and we have a representation based on word2vec embeddings for the whole text, like very good movie. And, that's some of word2vec vectors actually works in practice. It can give you a great baseline descriptor, a baseline features for your classifier and that can actually work pretty well. Another approach is doing a neural network over these embeddings. Let's look at two examples. We have a sentence "cat sitting there", or "dog resting here", and for each word, we take a row that actually represents a word2vec embedding of length, let's say 300. And now we want to apply neural network here somehow. And, let's first think about the following thing, how do we make use of 2-grams using this representation? Because, when you had bag of word representation for each particular 2-gram, you had a different column, and you had a very long sparse factor for all possible 2-grams. But here, we don't have word2vec embeddings for token pairs, we actually have word2vec embeddings only for each particular word. So, how can we analyze 2-grams here? Actually, it turns out that we can look at the pairs of those embedding vectors, and you can think of it as a sliding window. So, here in green border, we have first two words, and we take their word embeddings, and we want to take all those values and we want to analyze them somehow with neural network. And for that purpose, we can actually use a convolutional filter that has the same size, that has some numbers. And if you take the values that are pretty close to the values that correspond to "cats sitting" that means that when you convolve with that filter, the 2-gram that is "cat sitting", you will have a high activation just because the convolutional filter is very similar to the word embeddings of these pair of words. And, okay. So, now we know how we can analyze 2-grams in our text. We just convolve the word vectors that are near. But why is it better than bag of words? In bag of words manner, for each particular 2-gram, we had a different column. And here, we have to come off with a lot of convolutional filters that will learn that representation of 2-grams and we'll be able to analyze 2-grams as well. Why is it better than? It turns out that using a good property of word2vec embeddings, which is the following, that similar words, similar in terms of the context that they are seen in, similar words, they are similar in terms of cosine distance. And a cosine distance is similar to dot product. And its product is actually a convolution that we're doing. So, that means that if you take a different sentence like "dog resting here", you can actually find that cat and dog have similar representations in word2vec just because they're seen in the same context like my dog ran away or my dog ate my homework. And you can replace dog with cats and that would be a frequent sentence as well. So, why convolutional filter is better? Because, you can take an n-gram dog resting, and thanks to the fact that those values are pretty similar to the values of 2-gram cat sitting. That means that when you convolve it with the same convolutional filter, you will have a high activation value as well. So, it turns out that if we have good embedding of all of vectors, then using convolutions, we can actually look at more high-level meaning of the two gram. It's not just cat sitting, or dog resting, or cat resting, or dog sitting, it actually animals sitting, and that is the meaning of that 2-gram that we can learn with our convolutional filter. So, this is pretty cool. Now we've done neural columns for all possible 2-grams, you just need to look at the pairs of word embeddings and learn convolutional filters that will learn some meaningful features. Okay. So, you can see that, that can be easily extended to three-grams, three-grams and any other n-gram. And contrary to a bag-of-words representation, your feature metrics won't explode, because your feature metrics is actually fixed. All you change is the size of the filter, with which you do convolution and that is a pretty easy operation to do. You can also see that just like in convolutional neural networks, one filter is not enough. You need to track many n-grams, you need to track many different meanings of those two, three grams and that's why you need a lot of convolutional filters. And these filters are called 1D convolutions because we actually slide the window only in one direction. Contrary to let's say, image where we slide that window both in two directions. And let's see how that sliding window actually works. We have an input sequence, cat sitting there or here, we have for each word a word to back representation and we have that sliding window of size three. And let's add some padding so that the size of the output is the same as the size of the input. Let's convolve the first patch that we got from these metrics and let's say we get a 0.1, then 0.3, minus_0.2, 0.7 and minus_0.4. And, what you actually see here is that we slide that window only in one direction and that direction is actually time. You can think about the sequence of words that happen in time and that words occur on time axis. Okay, so what do we do with these numbers now? The bad property is that we have the same number of outputs and it is equal to the number of inputs. That means that if you have variable length of symptoms, then you have variable number of features. And we don't want that because we don't know what to do with that. So, let's assume that just like in a bag-of-words manner, we can actually lose the ordering of the words. That means that we don't really care where we've seen a two-gram, meaning animal sitting that we actually try to find with these convolutional filter. We don't care where it occurred, in the beginning of the sentence or at the end. The only thing we care is whether that combination was actually in the text or not. And if you assume that, then all you can do is you can actually take the maximum activation that you got with this convolutional filter going through the whole text, and you take that value as the result of your convolution, and that is actually called Maximum puling over time. Just like in images, they have maximum puling, here we apply it over time. So, what we've done, we're taking an input sequence, we've proposed to take a convolutional window size of three by the number of variables and embedding, and we convolve with that filter sliding in one direction and then we take the maximum activation and that is our output. Okay, let's come to the final model. The final architecture might look like this. We can use the filters of size three, four and five, so that we can capture the information about three, four and five grams, and for each n-gram we will learn 100 filters. That means that effectively we have 300 outputs. And let's look at the image. We have an input sequence and let's say that for the red window that corresponds to some convolutional filter, the maximum activation was 0.7 and we have it in the output. For the other filter size which is in green and let's say it is for two-grams, if we convolve it throughout the whole sentence, then the maximum value that we've seen is minus_0.7 and we add it to the output. And this way using different filters of different size we have 300 outputs. Okay, so what do we do with that vector? That vector is actually a kind of embedding of our input sequence and we've proposed a way how we can convert our input sequence into a vector of fixed size. What do we do next is an obvious thing. We just apply some more dense layers and we actually apply multi-layer positron on top of those 300 features and train it for any task we want. It can be either classification or regression or anything else. Okay, so let's compare the quality of these model with bag-of-words approach that is classical. Actually, there is a link to the paper where they've done those experiments, and they have a customer reviews dataset and they compared their model with naive buyers on top of one and two grams. And those classical model gave 86.3 accuracy. And if you use these proposed 1D convolutions architecture with MLP on top of those features, then you get whopping 3.8 bump in the accuracy and it gives you almost 90 percent accuracy. And that is pretty cool because we just apply neural networks, we propose how we can embed our words and we can use a lot of unsupervised text for learning of those embeddings and we actually proposed how we can analyze two-grams or three-grams using convolutions and that are all pretty fast operations. So it works even faster than bag-of-words. And it works better so, this is pretty cool. Okay let's, summarize. You can just average pre-trained word2vec embeddings for your text. So you split your text into tokens. For each token you take an embedding vector and you just sum them up. So, that is a baseline model and it can actually work pretty well. Another approach which is a little bit better is to use 1D convolutions that we have described. And this way you train neural network end to end. So, we have an input sequence and you have a result that you want to predict, and you use back propagation and train all those convolutions, to train the specific features that this neural network needs to classify your sentence. In the next video we will continue to apply convolutions to text.