Here's our list. Good feature columns must be related to the objective, known at prediction-time, numeric with a meaningful magnitude, have enough examples present, and lastly, you're going to bring your own human insight to the problem. First up, a good feature needs to be related to what you're actually predicting. You need to have some reasonable hypothesis for why a particular feature value might matter for this particular problem. You can't just throw arbitrary data in there and hope that there's some relationship somewhere for your model to figure out. Because the larger your dataset is, the more likely it is that there's lots of these spurious or strange correlations that your model's going to learn. Let's take a look at this. What are the good features shown here for horses? Well, it's a trick question. If you said it depends on what you're predicting, you're exactly right. I didn't tell you the objective of what we're after. If the objective is to find what features make for a good racehorse, you might go with the data points on breed and age. Does the color of the horse's eyes really matter that much for racing? However, if the objective was to determine if certain horses are more predisposed to eye disease, eye color may indeed be a valid feature. The point here is that you can't look at all your feature columns in isolation and say whether or not one is a good feature. It's all dependent upon what you're trying to model or what your objective ultimately is. Number 2, you need to know the value at the time that you're doing the prediction. Remember the whole reason to build the ML model is that you can predict with it. If you can't predict with it, there's no point in building and training an ML model. A common mistake that you are going to see a lot out there is to just look at the data warehouse that you have and take all of that data, all the related fields, and then throw them all into a model. If you take all these fields and just throw them into an ML model, what's going to happen when you're going to go and predict with it? Well, when you go in at prediction time, you may discover that your warehouse had all kinds of good historical sales data, even if it's come perfectly clean, that's fine. Use that as the input for your model. Say you wanted to get, how many things were sold on the previous day, that's now an input to your model. But here's the tricky part. It turns out that daily sales data actually comes into your system a month later. It takes some time for the information to come out to your store, then there's a delay in collecting and processing this data. But your data warehouse has this information because somebody went through all the trouble of taking all the data and joining all the tables and putting it all in there. But at prediction time or at real-time, you don't have that data. Now the third key aspect of a good feature, is that all your features have to be numeric and they have to have a meaningful magnitude. "Why is that?" you ask. Well, ML models are simply adding, multiplying, and weighing machines. When it's actually training your model, it's just doing arithmetic operations, computing trigonometric functions and algebraic functions behind the scenes, and your input variables. Your inputs need to be numbers and trickily your magnitudes need to have a useful meaning, say like the number two that's present there, is in fact twice a number one. Let's do a quick example. Here we're trying to predict the number of promo coupons that are going to be used and we're to look at the different features of that promotional coupon. First up, is a discount percentage, like 10 percent off, 20 percent off, etc. Is that numeric? Sure. Yeah. Absolutely as a number. Is the magnitude meaningful? Yeah, in this case, absolutely. A 20 percent off coupon is worth twice as much as a 10 percent off coupon. It's not a problem. This is a perfect example of a numeric input. What about the size of the coupon? Say I define it as four square centimeters, 24 square centimeters, and 48 square centimeters. Is that numeric? Yeah, sure. So is 24 square centimeters six times as much or more visible than one that's four square centimeters? Yeah, it makes sense. You can imagine this is numeric, but it's also unclear whether or not that magnitude is really meaningful. Now if this was an ad you're placing, the size of the banner ad, larger ads are better and you could argue that that makes sense. But if it's a physical coupon and it's something that goes out and like the newspaper, and then you got to wonder whether or not a bigger or 48 square centimeter coupon really is twice as good as a 24 square centimeter coupon. Now, let's change the problem a little bit. Suppose we define a size of the coupon as small, medium, and large. At that point are small, medium, large numeric? No, not at all. Now, I'm not saying you can't have categorical variables as input to your models. You can, but you just can't use strictly small, medium, and large directly. We got to do something smart with them and we'll take a look at how to do that shortly. Let's go with the font of an advertisement. Arial 18, Times New Roman 24. Is this numeric just because it has numbers in it? No. Well, how do you convert something like Times New Roman to numeric? Well, you could say Arial is number one, Times New Roman is number two, Roboto is number three and Comic Sans is number four. But that's a number code. They don't have meaningful magnitudes. If we said Arial is one and Times New Roman is two, Times New Roman is not twice as good as Arial. The meaningful magnitude part is really important. How about the color of the coupon, red, black, blue. Again, these aren't numeric values and they don't have meaningful magnitudes. We can come up with RGB values to make the color values numbers. But again, they're not going to be meaningful numerically. If I subtract two colors and the difference between them is three, does that mean if I subtract two other colors and the difference between them is also three, are these the same? Are they commensurate? No, and that's the problem with magnitude. How about item category? One for dairy, two for Deli, three for canned goods. As you've seen before, these are categorical, they're not numeric. Now, I'm not saying that you can't use non-numerical values as we said before, we've seen it do something to them, and we look at things that we need to do to them. Using an example, suppose you have words in a natural language processing system, the things that you need to do to words to make them numeric is that you could simply run something that's called Word2Vec. It's a very standard technique and you basically take all of your words, apply this technique to make those words numerical vectors, which as you know, have a magnitude. Each of these words becomes a vector. And at the end of Word2Vec, when you look at these vectors, the vectors are such that if you take a vector from man and you take a vector from woman and you subtract them, the difference that you're going to get is going to be very similar to the difference if you take the vector for king and you take the vector for queen, and if you subtract them, that's what Word2Vec does. Changing an input variable that's not numeric to be numeric, it's not a simple matter, it's a little bit of work. Well, you can just go ahead and throw some random encoding in there. But your ML model is not going to be as good as if you started with a vector encoding that's nice and understands the context of things like male and female, man and woman, king and queen. That's what we're talking about when you say numeric features and meaningful magnitudes. They've got to be useful so you can do the arithmetic operations on them during your ML processing phase. Point Number 4, you need to have enough examples of that feature value in your dataset. A good starting point for experimentation is that you need to have at least five examples of any value before I'll use it in my model. At least five examples of a value before you use it in training or validation or so on. If we went back to our promo code example if you want to run an ML model in our promotion codes that gave you 10 percent. You may well have a lot of examples of 10 percent promo coupons in your training dataset. But what if you gave a few users a one-time discount code of 87 percent off? Do you think that you'll have enough instances in your dataset of an 87 percent discount code for your model to learn from? Likely not. You want to avoid having values of which you don't have enough examples to learn from. Notice I'm not saying that you have at least five categories, like 10 percent off, 20 percent, 30 percent off. I'm not saying that you need to have at least five samples, like four records or five records in a column. I'm saying that for every value of a particular column, you need to have at least five examples. In this case, five instances, at least for 87 percent off discount coupon code that hadn't been used before we even consider using it for ML. Last but not least, bring your human insight to the problem. Recall and re verbalize a reason to all of our responses for what makes it a good feature or not. You need to have a subject matter expertise and a curious mind to think of all the ways that you could construe a data field as a feature. Remember that feature engineering is not done in a vacuum, after you train your first model, you can always come back and add or remove features for Model Number 2.