Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

In this module, we want to discuss some simple methods for detecting changes,

anomalous changes, in time series.

Oftentimes, we are faced with such a large amount of

data that we don't have time to visualize all of it and just look through all of it,

and sift it and try to hypothesize about things.

We want to use statistical methods to help us identify potential places in the data

that may be problematic or maybe pointing

to trends or values that we should be looking at.

And one of the most common ways of doing this with

time series data is through Control Chart Analysis.

And in this module, we're going to talk specifically about

doing Control Chart analysis of the data.

So in other modules in this unit,

we're going to talk about data mining techniques for examining time series,

looking for patterns and anomalies.

And the idea of this whole course is really to think

about how do we combine statistics and data mining,

with visualizations to help us enhance the visualizations to show what's important.

And we use this and exploratory analysis,

we can help people think about- I might think this trend looks interesting,

show me similar trends

or I don't know what trends are interesting and I have so much data,

show me things that are anomalous,

help me find key patterns,

help me find anomalies.

And so, Typical Time Series Analysis includes Trend Analysis.

So trying to figure out a company's linear growth in sales over the years,

or looking for seasonality.

So, a company's sales may look something like this.

And the linear growth,

they may have some sort of underlying linear growth,

but they have seasonality too.

And then with forecasting, you might want to think about,

well, how much sales do we expect next quarter?

And if we don't meet our target,

what happened and why?

And start reasoning with data for decision making.

And so, one of the most common tools for

exploring patterns and anomalies in time series data is a Control Chart.

And so for temporal data,

we can find what we call statistical anomalies through control charts.

And control charts just consist of a statistic representing some measurement in time.

And what we do, is we calculate

the mean and standard deviation given all the available samples and basically,

if the current value is greater than some pre-set number of standard deviations,

then we generate an alert.

So what do I mean by this?

Well, let's imagine we have data at a hospital for the number this is,

let's say day, and this is the number of patients that come in every day.

And so, we go along and then all of a sudden we start seeing this.

What happened?

Why was there such a big upswing?

And how can we capture that this occurred without

having to have somebody visualize this or look at this?

Of course, we can set rules in the database where we say, well,

if there's three times as many people that came in today as yesterday,

somebody better look at this.

But if there is just small changes,

as well as small blips or they change very slowly over time,

how do we capture this?

How do we think about this?

And Control Charts allow us to sort of set up statistics to do this.

So, a control chart is essentially a graph used to study how a process changes over time,

whether it's number of patients coming to a hospital,

quality of parts coming out of a manufacturing facility,

data re-plotted in time order.

And what we do is for each day,

we use the preceding X number of days and X can be seven,

30, whatever, it depends on the study.

We use the X number of days,

we calculate the average value of those X number of days,

the standard deviation and then we have a line

for showing when we're two or three standard deviations above the value.

So, here we have the mean and we're showing

how many standard deviations above or below the mean we are for a given day.

So, everyday the mean is recalculated,

with the mean as our baseline,

and then we show how many standard deviations above or below we are.

So, we have a central line for the average and upper line

for the upper control limit and a lower line for the lower control limit.

And these lines are determined from our historical data.

So we use a control chart and we get a control

an ongoing process by finding and correcting problems as they occur.

So we're using this in real time to say, "Hey,

are we suddenly getting outside of our bounds of what's normal?"

We can also use these to predict the expected range of outcomes for a process.

We use this to determine if a process is stable and not really going outside the bounds,

and analyzing patterns of process variation from special causes or common causes.

And also, to determine whether the quality improvement projects should aim

to prevent specific problems or make fundamental changes to the process.

So in the class project,

we talked about doing this amusement park and this theme park sort of thing.

And with the theme park,

we could think about what are the number of riders riding on a ride every five minutes?

Every half hour? Every hour?

So if we have a particular roller coaster,

we want to know how well it's doing or if the line is jamming up,

we can start using process control.

We can use this for a restaurant,

how many people were able to go through the restaurant every hour.

Because we expect it should be within some sort of controlled value.

And for Control Chart,

it's really just as simple as calculating

the mean and standard deviation from historical data.

And that window of historical data that we're looking

at is what's going to determine the controls.

The most common is a Moving Average Chart,

where we're going to monitor the process location over time.

This is generally used for detecting small shifts in the process mean.

So, what I mean by moving average is,

here's our data and here's our newest value.

So our moving average is going to use

the last X chunk of data, so it's going to have a window.

We get the new value,

we're again take in the last X chunk of data,

take the last X chunk of data and so forth as we go through,

and then control them to drive in the average range on the Range Chart.

So, for example, let's say we're looking at

a stock market set of data and we have daily closing stock market value.

So it went; 11,

12, 13, 14, 15,

16 and we want to do a five day moving average.

So for the first five days,

we can't calculate anything,

we don't have enough historical data.

So that also becomes a problem,

as we need enough historical data to find a baseline.

So, once we have five days of data,

we can calculate the moving average of 13,

14, 15 as the days go on.

Now, for an Exponentially Weighted Moving Average,

we're going to, instead of taking the average like we showed there,

instead we're going to have some sort of multiplier and we might use

the Exponentially Weighted Moving Average of

the previous day times multiplier plus the EMA of the previous days.

So, trying and to adjust and capture quick moving trends as well.

With all of these,

we have to take into account how long that historical window is.

So if we use a shorter moving average,

it's nice because it's nimble and quick to

change but it may not be capturing enough data to be able to

smooth out some jumps that are not needed to be explored.

A longer lag, we get the longer,

the moving average, we get more lag,

longer moving, slow to change.

Now, the difference between a Simple Moving Average and Exponential Moving Average,

it's not that one is better than the other,

it's they can capture different elements.

So the length of your moving average depends on

your analytical goals and the choice of exponential versus simple,

again depends on your analytical goals too.

So a Simple Moving Average with a long time window,

is good for tracking slow moving historical trends and changes,

where Exponential Moving Average might be good at

capturing quick upticks in those things.

But what we can do, is when we're doing these tests,

we can set up our Control Charts and find anomalies in all of our time series.

So imagine for our amusement park data,

we have 100 different rides.

So we want to visualize 100 different time series data?

Maybe not, but maybe we want to show which ones had anomalies within

the data by doing a control chart processing on

the data and then showing the elements that are important.

So, I remember going back to this idea of analyze first,

show what's important, filter, visualize, analyze again.

And so again, this is just another tool we can put in our bag of tricks for doing

data exploration which can lead to enhancing our data visualizations. Thank you.

Explore nosso catálogo

Registre-se gratuitamente e obtenha recomendações, atualizações e ofertas personalizadas.