0:06

In general, we can use the empirical count

of events in the observed data to estimate the probabilities.

And a commonly used technique is called a maximum likelihood estimate,

where we simply normalize the observe accounts.

So if we do that, we can see, we can compute these probabilities as follows.

For estimating the probability that we see a water current in a segment,

we simply normalize the count of segments that contain this word.

So let's first take a look at the data here.

On the right side, you see a list of some, hypothesizes the data.

These are segments.

And in some segments you see both words occur, they are indicated as ones for

both columns.

In some other cases only one will occur, so only that column has one and

the other column has zero.

And in all, of course, in some other cases none of the words occur,

so they are both zeros.

And for estimating these probabilities, we simply need to collect the three counts.

1:20

So the three counts are first, the count of W1.

And that's the total number of segments that contain word W1.

It's just as the ones in the column of W1.

We can count how many ones we have seen there.

The segment count is for word 2, and we just count the ones in the second column.

And these will give us the total number of segments that contain W2.

The third count is when both words occur.

So this time, we're going to count the sentence where both columns have ones.

1:56

And then, so this would give us the total number of segments

where we have seen both W1 and W2.

Once we have these counts, we can just normalize these counts by N,

which is the total number of segments, and

this will give us the probabilities that we need to compute original information.

Now, there is a small problem, when we have zero counts sometimes.

And in this case, we don't want a zero probability because our data may be

a small sample and in general, we would believe that it's potentially possible for

a [INAUDIBLE] to avoid any context.

So, to address this problem, we can use a technique called smoothing.

And that's basically to add some small constant to these counts,

and so that we don't get the zero probability in any case.

Now, the best way to understand smoothing is imagine that we actually observed more

data than we actually have, because we'll pretend we observed some pseudo-segments.

I illustrated on the top, on the right side on the slide.

And these pseudo-segments would contribute additional counts

of these words so that no event will have zero probability.

Now, in particular we introduce the four pseudo-segments.

Each is weighted at one quarter.

And these represent the four different combinations of occurrences of this word.

So now each event, each combination will have

at least one count or at least a non-zero count from this pseudo-segment.

So, in the actual segments that we'll observe,

it's okay if we haven't observed all of the combinations.

So more specifically, you can see the 0.5 here after it comes from the two

ones in the two pseudo-segments, because each is weighted at one quarter.

We add them up, we get 0.5.

And similar to this, 0.05 comes from one single

pseudo-segment that indicates the two words occur together.

4:09

And of course in the denominator we add the total number of pseudo-segments that

we add, in this case, we added a four pseudo-segments.

Each is weighed at one quarter so the total of the sum is, after the one.

So, that's why in the denominator you'll see a one there.

4:36

Now, so to summarize, syntagmatic relation can generally

be discovered by measuring correlations between occurrences of two words.

We've introduced the three concepts from information theory.

Entropy, which measures the uncertainty of a random variable X.

Conditional entropy, which measures the entropy of X given we know Y.

And mutual information of X and Y, which matches the entropy reduction of X

due to knowing Y, or entropy reduction of Y due to knowing X.

They are the same.

So these three concepts are actually very useful for other applications as well.

That's why we spent some time to explain this in detail.

But in particular, they are also very useful for

discovering syntagmatic relations.

In particular, mutual information is a principal way for

discovering such a relation.

It allows us to have values computed on different pairs of

words that are comparable and so we can rank these pairs and

discover the strongest syntagmatic from a collection of documents.

Now, note that there is some relation between syntagmatic relation discovery and

[INAUDIBLE] relation discovery.

So we already discussed the possibility of using BM25 to achieve waiting for

terms in the context to potentially also suggest the candidates

that have syntagmatic relations with the candidate word.

But here, once we use mutual information to discover syntagmatic relations,

we can also represent the context with this mutual information as weights.

So this would give us another way to represent

the context of a word, like a cat.

And if we do the same for all the words, then we can cluster these words or

compare the similarity between these words based on their context similarity.

So this provides yet another way to do term weighting for

paradigmatic relation discovery.

And so to summarize this whole part about word association mining.

We introduce two basic associations, called a paradigmatic and

a syntagmatic relations.

These are fairly general, they apply to any items in any language, so

the units don't have to be words, they can be phrases or entities.

7:11

We introduced multiple statistical approaches for discovering them,

mainly showing that pure statistical approaches are visible,

are variable for discovering both kind of relations.

And they can be combined to perform joint analysis, as well.

These approaches can be applied to any text with no human effort,

mostly because they are based on counting of words, yet

they can actually discover interesting relations of words.

7:44

We can also use different ways with defining context and segment, and

this would lead us to some interesting variations of applications.

For example, the context can be very narrow like a few words, around a word, or

a sentence, or maybe paragraphs, as using differing contexts would

allows to discover different flavors of paradigmatical relations.

And similarly, counting co-occurrences using let's say,

visual information to discover syntagmatical relations.

We also have to define the segment, and the segment can be defined as a narrow

text window or a longer text article.

And this would give us different kinds of associations.

These discovery associations can support many other applications,

in both information retrieval and text and data mining.

So here are some recommended readings, if you want to know more about the topic.

The first is a book with a chapter on collocations,

which is quite relevant to the topic of these lectures.

The second is an article about using various

statistical measures to discover lexical atoms.

Those are phrases that are non-compositional.

For example, hot dog is not really a dog that's hot,