0:00

In this video on visualizing numerical data, we will discuss scatter plots for

paired data and other visualizations for

describing distributions of numerical variables.

The data come from gapminder which pulls this information from a variety of data

sources.

We will be working with two numerical variables.

Income per person, that's in US dollars and life expectancy, in years, for

the year 2012.

Each observation in this data set in a country.

That data set contains data from most but

not all countries, since this information wasn't available for certain countries.

A common tool for

visualizing the relationship between two numerical variables is a scatter plot.

To identify the explanatory variable in a pair of variables, we identify which of

the two is suspected in affecting the other and plan an appropriate analysis.

Since we might suspect that economic wealth of a country might effect

the average life expectancy of it's people, we have set up our analysis with

income as the explanatory and life expectancy as their response variable.

Generally, in a scatter plot, we place the explanatory variable on the x axis and

the response variable on the y axis.

It's very important to note that labeling variables as explanatory and response

does not guarantee that the relationship between the two is actually causal.

Even if an association between the two variables is identified.

We use these labels only to keep track of which variable we suspect

affects the other.

In fact, since these data are observational and

do not come from a randomized controlled experiment, we know that we can only talk

about correlation and not causation between the two variables.

So what is the relationship between these two variables?

The best way to answer this question is to visualize a line or

a curve going through a cloud of the data.

So here I'm drawing a curve that first shows

a positive increase in life expectancy as income increases and

then the relationship levels up such that countries with income levels above

a certain point still have roughly 80 to 85 years of average life expectancy.

2:39

The shape of the relationship.

Is it linear, or does it follow some other form?

The strength of the relationship.

Is the relationship strong?

Indicated by little scatter.

Or weak, indicated by lots of scatter.

And any potential outliers.

3:08

Let's take a closer look at the outliers.

Some of them have pretty high income levels.

Luxembourg, a rich country with a small population and

has higher income per person level.

Macao, a special administrative region in China And

Qatar, a country with a small population and lots of oil.

Another potential outlier is Nepal, where the life expectancy is considerable

higher than what would be expected for the low income level compared to others.

These are countries that we would indeed expect to behave differently than

the majority of the countries.

So it's not surprising that they stand out from the rest.

One naive way of dealing with outliers in data analysis is to immediately

exclude them.

But we're calling that approach naive because it's often not the right approach.

This is a good example of when the outliers might be very interesting

in cases.

And handling them with careful consideration of the research question and

other associated variables is important.

Now, let's take a look at the distributions of the variables,

individually.

One good way of visualizing the distribution of a numerical variable

is a histogram.

In a histogram, data are binned into intervals and

height of the bars represent the number of cases that fall into each interval.

In other words a histogram provides a view of the data density,

higher bars represent where data are relatively more common.

For example we can see that majority of the countries have average life

expectancies between 65 to 85 years old.

histograms are also very useful for identifying shapes of distributions.

In this case the distribution of life expectancies

appear to be left skewed which is expected

due to the leveling off of life expectancies we've identified earlier.

There's a physiological limit to how long people live.

And in most countries, people live up to that time but

there are some countries with much lower life expectancies and fewer and

fewer of these countries with lower and lower expectancies.

Resulting in a long left tail.

The distribution of income on the other hand is right skewed.

Incomes can't be negative so we have a natural boundary at zero, but

there is no real upper limit to how high incomes can go.

However, as we go higher and higher we have fewer and fewer countries

with such high levels of personal income resulting in a long right tail.

A shared characteristic between these two distributions

is that they're both unimodel.

Let's focus on these statements on skewness and modality for a bit.

5:38

First off, skewness.

Distributions are set to be skewed to the left side of the long tail.

In a left skewed distribution, the longer tail is on the left on the negative end.

If no skewness is apparent, then the distribution is said to be symmetric.

And in a right skewed distribution,

the longer tail is on the right, the positive end.

As you can see, the best way to assess the shape of distributions is to step back and

imagine a smooth curve outlining the distribution,

instead of focusing on the jagged edges of the bars in the histogram.

6:30

The distribution that you will most closely work with, and

in an introductory statistics course is unimodal, the normal distribution,

that you may also know as the bell curve.

A bimodal distribution might indicate that there are two distinct groups

in your data.

For example here's a distribution of heights of individuals at a preschool.

The first peak might be the kids and the second might be the teachers.

A uniformed distribution means there's no apparent trend in the data.

That high and low values of the variable are equally likely to occur.

Here's a distribution of the last digits of a random sample of people's social

security numbers.

As expected, the data show no trend as just as likely to have a social security

number that ends with a zero, as a six or a nine.

7:15

Assessing modality like shape is also

best done by imagining a smooth curve outlining the distribution.

Here is a trick, think of the bars as the histogram as wooden blocks and

imagine dropping a limp spaghetti over them and try to imagine how the limp

spaghetti would fall over and between the wooden blocks.

Peaks that are further from each other will likely result

in differentiable prominent peaks and

peaks that are close to each other like the ones around zero and two may not.

Identifying the number of modes is not an exact science, and

not one that you should dwell on too much.

Usually all you need to do is to determine whether the distribution is uniform

Unimodal or something else.

7:57

We should also note that the chosen bin width of the histogram can

alter the story the histogram is telling.

When the bin width is too wide, we might lose interesting details.

When the bin width is too narrow It might be difficult to get an overall picture of

the distribution.

The ideal bin width depends on the data you're working with.

So you should try playing with it until you're satisfied with the visualization.

8:38

Yet another visualization technique that is especially useful for

highlighting outliers is a box plot.

A box plot also readily displace the median.

The mid point of the distribution, this is the thick line inside the box, and

the interquartile range, the width of the box.

According to this box plot, the median life expectancy is roughly 73 years, and

the middle 50% of countries have average life expectancies between 65 and

77 years old.

In addition, countries with life expectancies that are below 48 years old

are considered to have unusually low life expectancies.

A box plot of the income distribution

shows the same right skewed distribution we've identified before.

And the outlying countries with unusually high per person income levels stand out

in this visualization as well.

9:28

One way of determining the skewness of a distribution from a box plot

is to imagine what the histogram would look like.

The peak of the distribution will be roughly around the median, and

the tails will extend out to the tails in the box plot.

There's one more visualization method that we will discuss in this video.

An intensity map.

For certain types of data, like the one's we've been working with in this video,

it might be useful to view the spatial distribution.

These displays reveal trends in the data, that many of the others did not.

For example, we can see that both income and

life expectancy are lower in Africa, but higher in North America and Europe.