Okay, we're going to take a slight diversion here into generating a random dataset, which we're going to do some machine learning on in a minute. You're thinking, why don't I just go and download a real dataset, and you could, but having our own dataset with our own degrees of non-linearity and definite correlations between the fields, gives us a lot of control when we're experimenting. I hope you'll see what I mean. So we'll just initialize H2O, as we normally do. I'm going to set a seed, a random seed so I can reproduce this database later, and I'm just going to set N to be how many random records I'm going to create. Maybe I don't want to be experimenting with big data just yet, later on maybe I want to create a million records using the same algorithm, using the same distributions and see how much more trouble the models have learning. Just going to define the four blood types. Yeah, where am I going with this? Now, create our data frame, has an id column running from 1-1,000 so far. Good time to point out, when you're machine learning, if you spot an id column, try and remove it before you learn your model because you know it's meaningless. What I'm going to do is create a blood type based on the ID. If you're not sure what that looks like, let me just have a look at the start of what we have so far. So the first person is blood type O, the next, AB, then B, then A, then O again, AB, and all the way through. I don't like that idea. I'm going to change my mind because blood types A and O are apparently a lot more common. So I'm just going to redo that step. Next, and blood type is a red herring. When you're creating your models, if your model says the blood type is significant, you have a problem. It's overlearned. I'm creating a database of people, as you've realized. I'm going to give them an age between 18 and 65. And now, I'm going to randomly choose how healthy eating they are on a scale of zero to nine, with most of them in the middle, at a five or a six. These two lines, make sure I don't have values smaller than zero or larger than nine. And I'm then going to do something very similar for the active lifestyle. But spot the non-linearity, people under 30 get a bonus of one. And if you look here, let's see, yes, it shifted the distribution to the right but in a non-linear way. Only for the people in our database under 30. Now I'm going to create an income for everybody, and this is what we're going to be learning in the later videos. I'm giving them a base salary of 20,000 plus their age times three squared, which goes from 22,000 to 58, 000. I'm then giving them a bonus if they're healthy eating because, I don't know, they turn up at work more often, they've got better concentration, I don't know. But I'm making this a real effect, pay attention. Then, another real effect, I'm reducing their salary if they have an active lifestyle because perhaps they take more holidays, perhaps they get injured more often. Then, this is the noise, it's the first type of noise I'm introducing, everybody is getting a random 0 to $5,000 bonus on their salary. Completely independent of everything else, even blood type. This is the second place I'm introducing noise and it's a bit more subtle because I'm rounding off to the nearest hundred. If we stopped at this point, if we just used income at the end of here, theoretically, our model could learn the underlying reasons perfectly. But as soon as we start clipping the data, we've introduced a step, which is noise, and then we've introduced a very obvious kind of noise here anyway. Finally, a simple step, I'm just going to import it but I don't want to call it d, I want to call it people. So I'm specifying a destination frame explicitly. Okay. What's been returned is the information the H2O has on the data frame we've just imported and it's showing me the first six people. If the data frame was already on the server and not in our own session, this line gets a handle to it. It doesn't download the data, it just gets a reference to it so we can build models on it or, as in this line, get a summary of what we have. Just to check what we've created. ID runs from 1-1,000, the blood types are as we expect, average age is 41, running from 64.97 to 18.02. I didn't ran the ages. We'll go with it. Healthy eating runs from zero to nine, as does healthy lifestyle, but note the slightly higher mean, same median, higher mean. An income raise of 22,600 to 64,600. Perfect. Just a warning when you're creating these random data sets, and because we're not looking at saving data until next week, don't shut down your client, your R client or your python client, because then h20 will shut down and your data is lost, and you have to recreate it again.