Remember these diagrams that they used to explain what neural networks were? You could think of the blue dots as maybe customers who buy a particular phone, and the yellow dots as customers who don't buy the phone. Perhaps the x-axis is the time since this customer last bought a phone, and perhaps the y-axis is the income level of the customer. Essentially, people who buy the product if it has been a long time since they bought the phone and they're relatively wealthy. So, look at this data. Can you come up with a line that more or less separates these two classes? Sure we can. It may have a little bit of error, it's not perfectly separable, but a linear model is probably pretty good here. So this is a linear problem. The blue dots and the yellow dots are linearly separable by the green line. Great. But what if our data looks like this? Can we still use a linear model? Well, it seems that I cannot draw a line that manages to separate the blue dots from the yellow dots. No, wherever I draw my line, there are blue points on either side of the line. That data are not linearly separable. So I cannot use a linear model. Can we be a bit more specific about what we mean by linear model? So lets Aksum axis here, x1 is one of our input variables, x2 is the other input variable. And what we mean when we say we cannot use a linear model is that there is no way to linearly combine x1 and x2 to get a single decision boundary that would fit the data well. In machine learning terminology, y is the target. Maybe blue equals one and yellow equals zero, those are the labels, and the w's and b, are the weights and bias that we are trying to learn. There is no way that we can modify the w's and or the b to fit this decision boundary. But is there some other way that we can continue to use a linear model? For simplicity, lets put that two axes in the center of the diagram so that the origin (0,0) is at the center of the diagram. You can obviously get the current x1 and x2 from the previous x1 and x2, by simply subtracting a constant. So, a linear model now, will still be a linear model in the old coordinate system, but now to this space, let's define a new feature, x3. X3 is going to be a feature cross, ready? So, define a new feature x3 as a product of x1 and x2. So, how does this help? So, take x3, the product of x1 and x2, where is it positive? Exactly, when x1 and x2 are both positive, or when x1 and x2 are both negative. And where is it negative, where is x3 negative? Exactly, when x1 or x2 is negative and the other one is positive. So, now we have x3. Can you see how the addition of x3 makes this solvable via a linear model? So, now we can find a rule such that the sine of x3 essentially gives us y. Of course that's just what we did. W1 and zero, w2 and zero, and w3 is one. Essentially, y is a sine of x3. The feature cross made this a linear problem. Pretty neat, don't you think? So, in traditional machine learning, feature crosses don't play much of a role, but that's because traditional ML methods were developed for relatively small datasets, and once you have datasets with hundreds of thousands to millions and billions of examples, feature crosses become an extremely useful technique to have in your tool chest. So, recall that we said that the layers in a neural network, allow you to combine the inputs and that is part of what makes neural networks so powerful. Deep neural networks let you have many layers, and since each layer combines the layers before it, DNNs can model complex multidimensional spaces. Well, feature crosses also let you combine features. And the good thing is, you can get a way with the simpler model, a linear model, and this is a good thing, simpler models are a good thing. So, feature crosses are a way to bring non-linear inputs to a linear learner, a linear model. But there is a bit of a caveat. Let me explain it in an intuitive way. Remember that I started this section by moving the axis into the middle of the diagram. Why did I do that?