In the last section, we introduced matrix factorization as a way to construct embeddings directly from statistics. However, the problem with this approach, is that it's too complex. The time complexity for matrix factorization is approximately quadratic, with respect to the smaller of the two sets, the set of terms or the set of documents, and there are hundreds of thousands of words in English and the more documents we have, the better our representations will be. So, quadratic is not good. More recently, researchers have begun to approach the process of creating embeddings, not by deciding how the objects being modeled to be compared, which is what the psychologists did or through matrix factorization techniques that minimize reconstruction error, like latent semantic analysis, but instead by using methods similar to what we've used in this course. What they did was to train a model on a task that required an understanding of the domain, and then treated the first layer of the model as the embedding, in effect, using transfer learning. One recent and influential example of this approach is called Word2Vec. It belongs to a family of shallow window-based approaches that borrow the idea of a context window to define co-occurrence, but don't actually construct a full matrix of co-occurrence statistics. Instead, these approaches used the contents of the sliding window to transform the sequence of words in the corpus into features and labels for their machine learning task, and it's like the same thing we did in the first module. However, unlike what we did in module one, where we where we use the final event in the window as the label, these researchers use the word at the center of the window as the feature, and its surrounding context as the label. We call these words that surround the central word, the positive words for a particular example, and the remaining words in the corpus as negative words. The model's task is to maximize the likelihood of positive words and minimize the likelihood of negative words. The architecture of the neural network in Word2Vec was actually very simple, it contained an input layer, a dense hidden layer and output layer. The input layer had one node for every word in the vocabulary plus one additional one for out of vocabulary words. The hidden layer contains a non-linear activation function and the researchers trained different versions of the model with different numbers of hidden layer nodes. The output layer had a node for every word in the vocabulary. But researchers found that it was not practical to use our normal full cross entropy for this architecture. Why was normal cross-entropy impractical in this case? The answer has to do with the number of classes. If you think about the soft max equation, its denominator requires summering over the entire set of output nodes, and this calculation gets expensive when you have a large number of classes, as you do when your set of labels is the size of the vocabulary. Instead of computing normal cross entropy, the authors of Word2Vec used negative sampling to make cross-entropy less expensive without negatively impacting performance. Negative sampling works by shrinking the denominator in softmax. Instead of summing over all classes, which is in this case is our vocabulary, it sums over a smaller subset. To understand how to arrive at the subset, think back to the original Word2Vec task. It trains the model to maximize the likelihood of positive words and minimize the likelihood of negative words. What the authors of Word2Vec realize, is that because of the size of the context window relative to the size of the vocabulary, the vast majority of words will be negative for a given training example. Their idea was to compute the softmax using all the positive words and a random sample of the negative ones, and that's where this technique got its name. Using a subset of words in softmax, cut down the number of weight updates needed, but the resulting network still performed well. For example, if the original sentence was "A little red fluffy dog wanders down the road," assuming our input word is dog and our context window extends four words to either side, we would have eight total positive words and a few 100,000 negative ones. Negative sampling would only use a portion of those though. Even with negative sampling, constructing word representations can be a computationally expensive task. One way of reducing its costs further, is to use fewer examples of common words. Consider our previous example sentence. In addition to positive words like little and road, which provides some semantic information, there are also words that don't really help understand the idea of dog, like a and the. Using fewer of these sorts of positive words cut down the size of the dataset, and further improved accuracy. Part of the reason that Word2Vec is become so widely known, is because the embeddings that produced exhibited semantic compositionality. That is, you could take to embeddings and combine them using mathematical operations and get results that were semantically plausible. For example, the add operation seems to work just like an end. In this table, I've put the embeddings that were added in the four closest resulting embeddings. For example, when you add Vietnam and capital, the first result is Hanoi, which is indeed the capital. However, performance window-based methods like Word2Vec may be, they are still optimizing using a noisy signal, the individual sequences of words in a corpus. Recently, some researchers set out with an idea to try and produce embeddings with the same powerful semantic properties as Word2Vec, but which, like the matrix factorization methods, optimized using the entire co-occurrence set of statistics, rather than just individual co-occurrence events. Wouldn't it be great, they thought, to use all that information instead of just noisy slices of it? Their approach is called glove, and it's sort of like a hybrid between the matrix factorization methods, like Latent Semantic Analysis, and the window based methods like Word2Vec. Like latent semantic analysis, it begins with a full co-occurrence matrix. However, unlike latent semantic analysis, it doesn't use matrix factorization to produce embeddings, instead like Word2Vec, it uses a machine learning model which is optimized using gradient descent and a loss function. But unlike Word2Vec, instead of predicting the context around a word, glove uses a novel loss function. The loss function they use is derived from a simple observation. The co-occurrence ratios of two words seems to be semantically important. Here, you can see that the likelihood of encountering solid in the context of ice is 8.9 times higher than encountering solid in the context of steam, which makes sense, because ice is a solid. Similarly, the likelihood of encountering gas in the context of ice is far less than the likelihood of encountering gas in the context of steam. If you considered less related terms like water and fashion, the ratio of their probabilities between ice and steam is close to one, indicating that they're just as likely to be found in the contexts of ice as they are in steam. What the glove researchers did, is they essentially do some reverse engineering. They knew that they wanted embeddings that could be composed like the Word2Vec embeddings, and so, they said, "if I had such embeddings, how would I combine them to get something equivalent to the ratio of probabilities? " Once they have such an expression, they use basic algebra to form their loss function. Here I've depicted the outputs of the model as a function of some word vectors, wi, wj and W tilde k. And I've set that equal to the ratio of probabilities P, I, J and P j k. To get the loss function, they simply subtracted this fraction, and you can tell it's a loss because it's supposed to be zero. The actual definition of what's inside the function is not something we'll cover here, but I encourage you to read the glove paper to find out more. In practice, both glove and Word2Vec are good ways of creating word embeddings. There are some task where glove seems to perform better, and others were Word2Vec does better. What's more important than choosing between glove and Word2Vec embeddings, is whether you choose to use pre-trained embeddings or train your own. Remember from the last course, that transfer learning works best when the source task, where the model was originally trained, is similar to the target task, which is what you want to use the model for. When the words you're modeling have specialized meanings or rarely occur in common usage, then the pre-trained embeddings aren't likely to be helpful, and it may be better to train your own from scratch. That's because the versions of glove and Word2Vec embeddings that are most widely available, are typically trained on things like Wikipedia. If however, the words you want a model are similar to those that occur in common usage, then pre-trained glove and Word2Vec embeddings can be very beneficial. Once you've done so, you then have to decide as we saw in the last lab, whether to make the pre-trained embeddings trainable or not. Recall from the last course, that the primary factor to consider when making this decision, is dataset size. The larger your dataset is, the less likely that letting the embeddings be trainable, will result in over-fitting.