0:24

And in these casinos, the favorite game was Cho-han,

when the dealer would toss two dice,

and the visitors would bet on the outcomes,

the outcomes being the sum of numbers on these two dice.

So "cho" means, as you probably guessed, "even", and "han" means "odd".

This is, of course, a fair game.

If the dice are not loaded, then there is 50/50 probability of winning.

However, in Bakuto casinos,

the dealers often used biased dice, "loaded" dice.

And in this case, the outcome may be shifted towards the

casino owner or towards insider backers.

Also, it may be fun ro play Cho-han, in a Yakuza run casino.

We will play an equivalent game,

where the dealer simply flips a coin, and we bet on heads or tails.

The challenge here

is that the dealer actually has a choice of two coins: fair and biased.

For the fair coin, the probability of heads is 1/2 of course, and for

the biased coin, the probability of heads is 3/4.

On this slide, I show the coins in blue and

green, but in reality, the coins are completely identical, so

you have no clue which coin the dealer used.

The question arises: Suppose the dealer made 100 flips and

63 of them are heads;

which coin did the dealer use, fair or biased?

2:13

This question, of course, doesn't make sense

because any coin, fair or biased, can result in 63 heads.

The right question to ask is: What coin is more likely

if 63 out of 100 flips resulted in heads?

How do we answer this question?

Here's a hint: 63 is a little bit closer to 75 than 50.

Should we then assume that the biased coin is more likely in this series of flips?

2:47

Before we come to a conclusion, let's try to estimate some probabilities.

Let's consider a sequence of n flips, denoted X = x1 x2 ... xn,

with k heads.

The probability that this sequence was generated by the fair coin

is of course (1/2)^n.

3:09

But the probability that it was generated by the biased coin

is (3/4)^k * (1/4)^(n-k)

How can we decide what coin is more likely?

Of course, if the probability of X being generated by the fair coin

is larger than the probability of X being generated by the biased coin,

then the fair coin is more likely.

Let's refer to these probabilities as the blue probability and

green probability, respectively,

when we talk about the probability of X being generated by the fair coin or biased coin.

And likewise, if the blue probability is smaller than the green probability,

then the biased coin is more likely.

However, when these probabilities are the same,

when in equilibrium state, when blue probability = green probability,

we can compute, by doing some arithmetic, that in this case,

k = log_2(3) * n

Or, in other words, k will be approximately equal to 0.632 multiplied by n.

Which means that when the number of heads,

or the fraction of the number of heads is smaller than 0.632,

it means that the dealer is more likely to have used the fair coin.

Therefore, our original intuition actually was incorrect.

If there are 63 heads in the series of 100 coin tosses,

then the dealer was more likely to use the fair coin,

even though 63 is closer to 75 than to 50.

The dealers in traditional Yakuza-run casinos were

shirtless to avoid accusation of tampering the dice

because a shirtless dealer is less likely to switch the fair dice into the biased dice.

But the reality was that even shirtless dealers were able to switch the dice,

and we will model these situations with two coins up the dealer's sleeve,

and instead of the previous case, where the dealer ran a serious of

5:36

coin flips, but always used either the fair or the biased coin,

we will assume that the dealer now can change the coin at any moment,

with probability 0.1.

So, after watching a sequence of flips,

can you tell when the dealer was the fair coin and

when he was using the biased coin?

Our attempt to read the dealer's mind brings us to the following

casino problem: Given a sequence of coin flips, determine

when the dealer was using the fair coin and when he was using the biased coin.

Do you think it is a well-formulated computational problem?

6:20

Of course it is not

because any outcome of coin tosses

could have been generated by any combination of fair and biased coins.

And therefore, we need to grade different scenarios, for example BBBBB, FFFFF,

6:43

But if we try to do this, then the question arises:

How can we explore and grade

2^n possible scenarios for coin tosses?

Here's a simple approach to reading the dealer's mind one window at a time.

Let's choose a small window, let's say a window of five tosses,

and for each such window,

let's decide which coin is more likely to generate the outcome in this window.

We already know how to answer this question.

For example, for this window, HHHTH, there are 80% heads,

therefore it's most likely to have been generated by the biased coin.

For the second window, we have only 60% of heads,

and therefore, this is more likely to be generated by the fair coin.

We will actually use a ratio of blue and green probabilities,

and we will decide what coin generated the given outcome

by simply computing this ratio.

If this ratio is smaller than 1, then the biased coin is more likely.

If this ratio is higher than 1, then the fair coin is more likely.

And we will also introduce log-odds ratio,

which is simply the base-two logarithm of this ratio, and in this case,

as we know, the base-two logarithm of the ratio equals

(number of tosses) - log_2(3) * (number heads)

and our decision rule for deciding whether a given window was generated by the

biased or fair coin will simply be: If the log-odds ratio is less than zero,

then it is a biased coin.

If it is larger than zero, then it is a fair coin, and

it is represented by this simple rule.

8:39

So, to continue reading the dealer's mind, we will go through the whole sequence,

and every time computing which coin is more likely, and then we will classify

all these windows into being more likely to be generated by the biased or the fair coin.

8:57

The question however, arises: What are the disadvantages of this approach?

The obvious disadvantage of the sliding window approach is that different windows

may classify the same flip differently.

Also, the choice of the window length always affects our conclusions.

But how do we know how to choose the window length?

9:21

And all that may be an interesting introduction to makeshift casinos and

coin flipping,

but what does it have to do with biology?

To explain what coin flipping has to do with biological applications, we will start

from a simple example of CG islands and then move to more complex examples.

10:06

content 46% while platypus has GC content 58%,

which means that, on average,

if the distribution was uniform, you would expect that both nucleotides G and C

appear in the human genome with a probability of 23%.

And therefore, you would expect that each of the dinucleotides

(CC, CG, GC, and GG) appears in the genome with frequency 5.29%.

But the reality is that the frequency of CG in the human genome is 5 times smaller.

Why?

Methylation is a DNA modification that adds methyl CH3 group

to the cytosine nucleotide, and

it often happens to a C in a CG dinucleotide.

11:01

The resulting methylated cytosine has a tendency to deaminate into

thymine, and as a result of methylation,

CG is the least frequent dinucleotide in many genomes.

However, methylation is often suppressed in the areas around

genes called CG-islands, where CG appears frequently.

For this example,

this would be a CG-island.

And the question arises: Suppose you want to find genes.

How would you search for CG-islands as a prerequisite for finding genes?

Well, if we were to use the same paradigm of Log-odds ratio,

and simply classify CG-islands as more likely in areas where this

Log-odds ratio is less than 0, and non-CG island as

more likely in the areas where Log-odds ratio is larger than 0,

then we will arrive at a classification algorithm for CG-islands.

However, there are a few issues.

As before,

different windows may classify the same position in the genome differently.

It is not clear how to choose the length of the window to construct CG-islands,

and very importantly it is not clear whether we should use the same

length of windows for different regions of the genome.

And in the next section, I will describe the notion of

Hidden Markov Models that provides a new, better paradigm, for

problems like finding CG Islands and many, many other problems in bioinformatics.