Many of our data science projects will eventually involve the application of machine learning, and in order for those machine learning algorithms to be effective, we often will need to make adjustments to our data set and make adjustments to the variables within it. The reason for this is that different machine learning algorithms are sensitive to or influenced by certain properties of the data. As we analyze our information, we will be able to identify adjustments that may be necessary in order for that data set to be the most useful to our particular machine learning problem. For an example, our data may have a skewed distribution, and we may have identified that perhaps through a visualization. The skewed distribution may impact certain machine learning algorithms, whereas it may have less impact on others and so we may determine that there's a need to adjust the distribution in order to make it more useful. We could also have a situation where some of our variables that are being considered are really in a much different scale than other variables and in that situation, we might need to perform some type of scaling or normalization on those variables to bring them into a similar range so that again, the machine learning algorithms will be able to learn more effectively. There are also other situations that could negatively affect an algorithm's ability to learn. For an example, if there are a lot of outliers in a data set, that could also present a problem. At this stage where we are going to be applying data pre-processing, we're trying to adjust the data in such a way that it's going to be more useful for our machine learning processes later on. The goal here is going to be to apply a few different techniques in order to make sure that our data is ready. Now usually, you're going to be starting with a data set that's already gone through at least one iteration through your data transformation, but data transformation is iterative in nature. We're going to go back through that transformation process as we learn more about the data and as we need to adjust it for a specific task. Some of the different techniques that we might apply it at this point, where we've already got a data set that's been brought into a more common format where our data is already been massaged so that it's consistent at least, and now we're going to be looking at the specific adjustments that would be necessary to support our algorithms. One thing, of course, we might look at handling missing values or null values in the data set. These may be actually handled earlier on in the process. If there are features that are missing a large number of values, then we may decide that dropping that feature earlier on is a better approach. But if there are still any records or any features that have missing values, we need to deal with them now and we'll be looking at that in an upcoming module. Also, we may need to scale variables, bringing them into a similar range because again, our mathematical machine learning algorithms often will assign more weight to larger values, and so bringing everything into a similar scale makes those algorithms more useful and able to make good skillful predictions, estimate and estimations and classifications. We also may, through our analysis, determine that there is very little correlation between a feature variable and the target variable, and if there's not a lot of correlation between them, that means that the feature variable is not influencing the outcome, and so therefore, it probably would be better to eliminate it from the data set so that it doesn't just add noise. We may also become involved in engineering new features. Even though we have probably a lot of features already, they may not be in a state that is ideal for machine learning. We may need to encode some of the information that are in existing features into new features that are more useful again, for our machine learning practices. We may take a feature that contains a lot of information and split it out into separate fields, we may use other encoding techniques as well to better represent the information, perhaps creating a calculated feature that is going to be more useful. By doing all of this data pre-processing, the goal, of course, is really to do some of the final adjustments that will ensure that the data is ready for machine learning.