So for data clustering,

we will be able to cluster P1 P2 together using P2 to represent the pattern.

So that means if we do this data clustering,

then all the patterns in the cluster can be represented by one pattern, P.

So the problem becomes whether we should mine all the patterns then compress them,

or should directly mine these compressed patterns.

Actually there's a efficient method which can directly mine those compressed

patterns.

I'm not going to get into the detail but you may refer to this interesting paper.

Okay, then another interesting thinking is Redundancy-Aware Top-k Patterns.

That means we want to get a desired pattern which is similar to the compressed

one, because want to get high significance and low redundancy.

These kind of of set up patterns, okay.

Let's look at this a, b, c, d, four different kind of compression.

Actually a is a set of original patterns.

There are cluster shields, their pattern distance, and the color

the darker shows is more significant, the lighter shows is less significant.

Okay, in that case you probably can see in this bigger cluster,

there are three patterns.

They are quite significant.

If you just do the top-k pattern mining, that means you take it as a support count,

or other significant measure, you would only find these three patterns.

Suppose we wanted only find top three, then all the remaining

patterns like here in the other cluster is completely missing.

But if you say I just do the summarization, try to find no clusters.

And within each cluster try to find their centers.

Then you'll pretty well find those less significant patterns, so

this may not be a good balance.

Actually better balance is you take care of both significance and the redundancy.

Simply says, you look at this one, there is something very significant.

And that they are also in the cluster center

you may want to show these patterns.

In the meantime, suppose you can only show three,

you may show these are significant and less redundant.

This one is significant and also it represents this cluster.

So the problem becomes how to develop efficient and

effective method finding such redundancy aware top-k patterns.

There's an interesting study which uses the max marginal

significance to measure the combined significance of a pattern and

develop efficient methods to mine such patterns.

We are not going to get into detail of this method.

Interested readers we made read the paper we pointed out.

Thank you.

[MUSIC]