Python programmers will often suggests that there are many ways the language can be used to solve a particular problem, but that some are more appropriate than others. The best solutions are celebrated as idiomatic Python, and there's lots of great examples on StackOverflow or other websites. As a sub-language within Python, Pandas has its own set of idioms. We've alluded to some of these already such as using vectorization wherever possible, and not using iterative loops if you don't need to. Several developers and users within the Pandas community, have used the term pandorable for these idioms, and I think it's a great term. So I wanted to share with you a couple of key features of how you can make your code more pandorable. So let's start by bringing in our Data Processing Library, so we will bring in pandas, so we will bring in numpy. We're going to bring in some timing functionality too, so I want to demonstrate something about idiomatic code, so we'll bring it from the time at module. Let's look at some of our census data from the US. So this is in the DataFrame or the dataset, sorry, datasets slash census.csv. We'll look at the head of that to remind ourselves what that looks like. So the first of the pandas idioms that I want to talk about is called method chaining. The general idea behind method chaining is that every method on object returns a reference to that object. The beauty of this is that you can condense many different operations on a DataFrame, for instance into one line or at least one statement of code. So here's a pandorable way to write code with method chaining. In this code, I'm going to pull out the state and the city names as a multiple index, and I'm going to do so only for data which has a summary level of 50, which in this dataset is county-level data. I'm going to rename the column too just to make it a bit more readable. So I'm going to say df.where the df sub some level equals 50, then I'm going to drop the DNAs that we have from that. So you see the where just returned to DataFrame and I just call.dropna on that, and then I'm going to set the index, and I'm going use a multiple index here. I want the state name and the city name to be the index, and then I'm going to rename some values and so the columns I'll change estimate based 2010 to something a little bit more readable, we'll execute that. So let's walk through this. First, we use the where function on the DataFrame and pass in a Boolean mask which is only true for those rows where the sum level is equal to 50. This indicates in our source data, that the data is summarized at the county level. With the result of the where function evaluated, we dropped missing values. Remember that where doesn't drop missing values by default. Then we said an index on the result of that. In this case, I've set it to the state name followed by the county name. Finally, I rename a column to make it more readable. Note that instead of writing this all on one line as I could've done, I began the statement with the parentheses which tells Python that I'm going to expand that statement over multiple lines for readability. Here's a more traditional non pandorable way of writing this. There's nothing wrong with the code in a functional sense. You might even be able to understand it better as a new person to the language. It's just not considered as pandorable as the first example. So let's create the new DataFrame from the original, so df equals df sub df some level equals 50. So here I'm using the overloaded indexing operator which drops none. So I'm actually doing a couple of things, I'm projecting and creating that Boolean array, and then I'm using that to project just certain values out of the DataFrame, not a number values are being dropped automatically, because that's the shortcut that's been put in placer. Then I'm going to update the DataFrame to have a new index, and we're going to use in-place equals true to do this in place. So df.set index, and then I'll pass in the columns I'm interested in, and then the in-place equals true tells the DataFrame to just modify itself. Then I'm going to set the column names, so I just rename the columns and this looks pretty much the same, and we see we get a similar result. So now the key with any good idiom is to understand when it isn't helping. In this case, you can actually time both methods to see which one runs faster. So we could put the approach into a function and pass the function into the time function to count the time, the parameter, number for time it allows us to choose how many times we want to actually run that function. So here I'm just going to set it to 10. So let's write a wrapper for our first function. So I'm going to define first approach, we'll make global DataFrame, and we'll just paste our some code right in here. So just to our pandorable code in here, and we're just returning all of that as a result of that first approach. I'm going to read in a new dataset. So it's nice and fresh. So df, and remember in our function we've made this global df, and now we're just going to run it. So I'm going to call timeit.timeit pass at the function, tell it how many times I wanted to run. Now let's test the second approach, and as you notice we're using the global variable df and the function. However changing a global variable inside of a function will modify the variable even in a global scope and we don't want that to happen in this case. So for selecting summary levels of 50 only, I have created a new DataFrame for those records. So let's run this for 10 times and see how it is. So depth second approach and I'll just paste the code from the second approach. You can see that it's basically the same as we spoke about, and then I'm going to read in a new dataset. So we'll read this from the census.csv, and now let's run it and do our timing on it. So timeit.timeit, and we'll pass in second approach and we'll run that 10 times as well. So as you can see the second approach is much faster, and so in this particular example it's actually a classic time readability trade off. You'll see lots of examples on StackOverflow in documentation of people using method chaining in their pandas. So I think it's really important for you to be able to read and understand the syntax and it's really worth your time to investigate this. But keep in mind that following what appears to be stylistic idioms might have performance issues, and you need to consider this as well really depends on the scope of the Data Cleaning you're doing. Here's another pandas idiom. Python has a wonderful function called map which is a basis for functional programming in the language. When you want to use map and python, you pass it to some function you want called and some iterable like a list, then you want the function to be applied to. The results of that function are then called against each item in the list, and there's a resulting list of all of the evaluations of that function. So pandas has something similar, and it's called apply map, and then apply map you provide some function which should operate on each cell of the DataFrame, and the return set is itself a DataFrame. Now, I think apply map is fine but I actually rarely use it. Instead, I find myself often wanting to map across all of the rows in the DataFrame not just the cells, and so pandas actually has a function that I use quite heavily there called apply, and so let's take a look at an example. So let's look at our census DataFrame. In this data frame, we have five columns for population estimates, with each column corresponding with one year of estimates. It's quite reasonable to want to create some new columns for minimum or maximum values, and the apply function is an easy way to do this. So first we need to write a function which takes in a particular row of data, finds a minimum and maximum value, and returns a new row of data from that. We'll call this function min-max. This is pretty straight forward. We can create a small slice of a row by projecting the population columns, then we can use NumPy, min and max functions and create a new series with the label values representing the new value that we want applied. So def min-max, and we'll pass it row and then data, we're just going to project the rows that we're actually interested in from the data frame. So from this row, we're just going to bring in our population estimates, and then we're just going to return a new series. In this new series, we're we've got two values a min and a max, and we're going to pass in NumPy min and the NumPy max from Data. So then we just need to call apply on the data frame. Apply takes the function and the axis on which to operate as parameters. Now, we have to be a bit careful. We've talked about axes zero being the rows of the data frame in the past, but this parameter is really a parameter of the index to use. So to apply across all rows, which is applying across all columns, you can pass an axis equal to one, or equal to the word columns itself. So df.apply, we push in our min-max function. Here I'll set the axis equal to columns, and let's look at the head of this. Of course, there's no need to limit yourself to returning a new series object. If you're doing this as part of data cleaning, you're likely to find yourself wanting to add new data to the existing data frame. In that case, you can just take the row values, and add in new columns indicating the max and min scores. This is a regular part of my workflow when bringing in data and bring building summary, or descriptive statistics, and is often used heavily with the merging of data frames. Here's an example where we have a revised version of the function min-max. Instead of returning a separate series to display the min and max, we add two new columns to the original data frame to store min. So I'll just depth min-max. Again, I'm going to bring in all the population row Data, and now I'm going to create a new entry for max. So I just say row sub max equals np.max of data, and then we can create a new entry for the min as well. So row sub min equals np.mean Data, and then we'll just return row. So now we actually have all of our pop estimate data in the row as well as our max and min. So we just apply this. Again, we want to say df.apply passing the function we're interested in applying, and then say that we want this applied across columns. So apply is an extremely important tool in your toolkit. The reason I introduced apply here, is because you rarely see it being used with actually such large function definitions like we did, instead you typically see it being used with Lambdas. To get the most out of this discussion, and the discussions that you'll see online, you're going to need to know how to read Lambdas. So you can imagine how you might chain several apply calls with lambdas together to create a readable yet succinct data manipulation script. One line example of how you might calculate the max of the columns using the apply function, is something we're going to do here. So I'll bring in the rows, again these are my population estimate rows, and now we'll just apply this across the data frame with Lambdas. So df.apply, and then I pass into apply Lambda x, and I just want np.max x sub rows and then here I'm saying axis equals one. I could have said axis equals column, one and column are synonymous and row and zero are Synonymous, and then we'll take the head of that. So this is something that you'll commonly see in StackOverflow examples, or messages, or even the official documentation. If you don't remember Lambdas, just pause the video for a moment and look up the syntax. A lambda is just an unnamed function in Python. In this case, it takes in a single parameter x and returns a single value. In this case the maximum over all the columns associated with row x. The beauty of the applied function is that it allows flexibility in doing whatever manipulation that you desire, as the function you pass into apply can be any customized function that you want. So let's say we wanted to divide the states into four categories; Northeast, Midwest, South, and West. We can write customized function that returns this region based on the state, and the state regions information, maybe we looked up on Wikipedia. So here I'll write a function def get_state_region, and for the Northeast, I'll just have a nice big list of states in the north east, the Midwest and of course the most important region, we'll put in a bunch of states from the Midwest, and there's lots of states here and they're all very important especially that third one there. Then for south, we'll bring in all the states that we want in the south region, and then the states that we want to bring in from the west as well. Then we're just going to write a little FL's. So if x is in the northeast return northeast, if x is in Midwest return Midwest, do the same for South, and we'll do the same for west. You can imagine that there's many ways that we could have written that function. So now we have a customized function. Let's say we want to create a new column called region, which shows the state region, we can use the customized function and the apply function to do so. The customized function is supposed to work on the state Name column, STNAME. So we'll set the apply function to the state name column, and pass the customized function into the Apply. So here's our example. We're going to create a new column in our data frame called state region, we're going to make this equal to df sub state names, so we're just projecting a single columns. So this isn't series, but it also has a.apply function. So we'll call.apply, and we'll pass in a Lambda, and that Lambda is just going to take whatever value it sees there, and call the Get state region to it. So let's take a look at those results. So even though it was a series that we were working on, we assign that to a data frame projections slice. So that means we've got our full data frame still. So we'll take a look at df sub state name, and state region and the head of that. So we can see here that we have both the state name and the state region throughout. So there are a couple of pandas idioms, but I think there's many more that I haven't talked about here, and there's an unofficial assignment for you. Go look at some of the top ranked questions on pandas, on stack, overflow, you're going to learn a lot from that, and look at how some of the more experienced authors answer those questions. Do you see any interesting patterns? Feel free to share them with others in the class and myself so that we can learn more about what idiomatic pandas looks like.