Corollary 8.19 is a subtle but very useful result. It says that if R_I(0) is strictly greater than zero, then R_I(D) is strictly decreasing for D between zero and D_max and the inequality constraint in the definition of R_I(D) can be replaced by an equality constraint. To prove the corollary, we first show that R_I(D) is strictly greater than zero, for D greater than or equal to zero, and strictly less than D_max. And this is done by contradiction. This will be achieved in a few sub-steps. Suppose R_I(D') is equal to zero, for some D' greater than or equal to zero and strictly less than D_max, and let R_I(D') be achieved by some X hat, then R_I(D') is equal to I(X; X hat) is equal to zero, which means that X and X hat are independent. Then we show that such an X hat, which is independent of X, cannot do better than a constant estimate X hat star. That is, expectation of d(X, X hat) is greater than or equal to expectation d(X, x hat star), which is equal to D_max. This is done by considering the following. First, D' is greater than or equal to expectation of d(X,X hat), which is equal to summation x, summation x hat, the probability of x and x hat times d(x,x hat). Since X and X hat are independent, p(x,x hat) can be written as p(x) times p(x hat). Now, p(x hat) does not depend on x, and so it can be moved outside summation x, and summation x, p(x) d(x,x hat) is simply expectation of d(X,x hat). This is greater than or equal to expectation of d(X,x hat star) by the definition of x hat star, and this expectation is equal to D_max. And finally, we obtain D_max. Thus we have shown that D' is greater than or equal to D_max. This is a contradiction because at the very beginning we assumed that D' is strictly less than D_max. And therefore we conclude that R_I(D) is strictly greater than zero, for D greater than or equal to zero and strictly less than D_max. In the second step, we note that R_I(D) must be strictly deceasing for D greater than or equal to zero and less than or equal to D_max, because R_I(0) is greater than zero, R_I(D_max) is equal to zero and R_I(D) is non-increasing and convex. The idea is illustrated in the following figure. Suppose R_I(D) decreases for a while, stay constant and then decreases to zero, at D equals D_max, then R_I(D) is not convex, which is a contradiction. In other words, R_I(D) cannot stay constant for a while. And so, R_I(D) must be strictly decreasing for D between zero and D_max. In the third step, we show that the inequality constraint in the definition of R_I(D) can be replaced by an equality constraint by contradiction. Again, this is achieved in a few sub-steps. First, we assume that R_I(D) is achieved by some X hat star, such that the expectation of d(X, X hat star) is equal to sum D'', which is strictly less than D. Then R_I(D'') by definition is equal to the minimum of I(X; X hat) over all X hat such that the expectation of d(X,X hat) is less than or equal to D''. This is less than or equal to I(X; X hat star), because the expectation of the distortion between X and X hat star is equal to D'', and by our assumption this mutual information is equal to R_I(D). Thus we have shown that while D'' is less than D, R_I(D'') is less than or equal to R_I(D), which is a contradiction, because R_I(D) is strictly decreasing for D between zero and D_max. And therefore, we conclude that expectation of d(X, X hat star) is exactly equal to D, instead of strictly less than D. Thus, the inequality constraint in the definition of R_I(D) can be replaced by an equality constraint. Now, we have finished the proof of Corollary 8.19. In this Corollary, the assumption is that R_I(0) is strictly greater than zero. This appears to be a rather restrictive assumption, but in fact it is not. The reason is that, in all problems of interest, R(0) which is equal to R_I(0), is greater than zero, because otherwise, R(D) is equal to zero for all D greater than or equal to zero, because R(D) is non-negative and non-increasing. Therefore, we have R_I(D) equals minimum I(X,X hat) over all X hat such that expectation of d(X,X hat) is equal to D. In example 8.20, we look at a binary source. Let X be a binary random variable with probability X equals zero equals one minus gamma, and probability X equals one equals gamma. Let the reproduction alphabet be the same as the source alphabet that is {0,1}, and d be the Hamming distortion measure. In the following, we will determine R_I(D), the information rate distortion function. First, we consider gamma between zero and one half. We will show that R_I(D) is equal to h_b(gamma) minus h_b(D), if D is between zero and gamma, and equals zero if D is greater than or equal to gamma. Since gamma is (less than or) equal to one half, that is, it is more likely that X is equal to zero, we see that x hat star, the best possible constant guess, is equal to zero. And D_max is equal to expectation of d of x and this constant guess zero, is equal to probability that X is equal to one, which is equal to gamma. Now, consider any estimate X hat and let Y equals the distortion between X and X hat. As d is the Hamming distortion measure, Y is simply the indicator function for the error event. Now, observe that conditioning on X hat, X and Y determine each other, because conditioning on X hat, if we know X then we know whether there's an error; on the other hand, conditioning on X hat, if we know whether there's an error, we also can determine the value of X. And so, the conditional entropy of X, given X hat is equal to the conditional entropy of Y given X hat. Then for any X hat, such that expectation of d(X, X hat) less than or equal to D, where D is less than gamma equals D_max, we have the mutual information between X and X hat which is equal to entropy of X minus entropy of X given X hat, where entropy of X is equal to h_b(gamma), and entropy of X given X hat, is equal to entropy of Y given X hat. Now entropy of Y given X hat, is upper bounded by entropy of Y, so with the negative sign, we obtain the lower bound h_b(gamma), minus entropy of Y. As Y is the indicator function for the error event, the entropy of Y is equal to h_b of the probability that X is not equal to X hat. We can further obtain the lower bound h_b(gamma) minus h_b(D), because the probability that X is not equal to X hat, that is the probability of error, is equal to the expectation of the distortion between X and X hat, which is less than or equal to D. And h_b(a) is increasing for a between zero and one half. Thus, we have shown that for any X hat, such that the expectation of d(X, X hat) is less than or equal to D, the mutual information between X and X hat is greater than or equal to h_b(gamma) minus h_b(D). Therefore, R_I(D), which is the minimum of the mutual information between X and X hat for all such X hat is greater than or equal to h_b(gamma) minus h_b(D). To prove that this lower bound is indeed tight, we need to prove that the inequalities in one and two can be tight simultaneously. Observe that the inequality in one is tight if and only if H(Y|X hat) is equal to H(Y), that is, Y is independent of X hat. And inequality in two is tight if and only if the probability that X is not equal to X hat, is equal to D. Now recall that in order to specify the random variable X hat, we can specify the joint distribution of X and X hat, such that the marginal distribution of X is exactly equal to the given source distribution for X. Thus we are motivated to start with the random variable X hat, pass it through a certain channel to obtain the random variable X. In order for this to work, we need to specify the distribution for X hat and a channel such that the output distribution is exactly equal to the source distribution for X. That is the binary distribution consisting of the probability masses one minus gamma and gamma. Toward this end, we notice that the binary symmetric channel with crossover probability equal to D is exactly what we need, because for the BSC, the probability of the error event is independent of the input variable X hat, that is, Y, the indicator function of the error event, is independent of X hat. Also, the probability that X is not equal to X hat is simply equal to D, the crossover probability. This binary symmetric channel is called a reverse binary symmetric channel. So it remains to specify a distribution for X hat, such that the output distribution is exactly the given source distribution. We claim that the distribution for X hat is given by one minus gamma minus D divided by one minus two D, and gamma minus D divided by one minus two D. So, let us verify that this is correct. We note that one minus gamma D divided by one minus two D times D plus gamma minus D divided by one minus two D, times one minus D is exactly equal to gamma. We also need to verify that the input distribution is a proper distribution, namely that gamma minus D divided by one minus two D is between zero and one. And this follows because D is less than gamma which is less than or equal to one half. The details are given in the textbook. So, we see that the lower bound that we have obtained for R_I(D) is indeed tight. Therefore, we conclude that for gamma between zero and one half, R_I(D) is equal to h_b(gamma) minus h_b(D) if D is between zero and gamma, and is equal to zero if D is greater than or equal to gamma. For the other case, when gamma is between one half and one, by exchanging the roles of the symbols zero and one and applying the same argument, we obtain R_I(D) as above except that gamma is replaced by one minus gamma, because in this case, one minus gamma is between zero and one half. That is, R_I(D) is equal to h_b(1-gamma) minus h_b(D), if D is between zero and one minus gamma and is equal to zero if D is greater than or equal to one minus gamma. Instead of writing out the expression for R_I(D) for two different cases, we actually can combine the two cases and obtain R_I(D) equals h_b(gamma) minus h_b(D), if D is between zero and the minimum of gamma and one minus gamma, and is zero if D is greater than or equal to the minimum of gamma and one minus gamma for all gamma between zero and one. To verify that this is correct, we first check the case for gamma between zero and one half. For this one, the minimum of gamma and one minus gamma is simply equal to gamma. And this is exactly as what we have obtained in step 7. For gamma between one half and one, the minimum of gamma and one minus gamma is one minus gamma, which is exactly the same as the bound we obtained in step 8. In addition, we notice that h_b(gamma) is simply h_b(1-gamma) and so, we have verified the case for gamma between one half and one. Now for the uniform binary source, that is the case when gamma is equal to one half, with the Hamming distortion measure, the information rate distortion function is equal to one minus h_b(D) if D is between zero and one half and is equal to zero if D is greater than or equal to one half. And this is shown in the following figure. Here is a subtle but important remark about the rate distortion theorem. The rate distortion theorem does not include the source coding theorem as a special case. In example 8.20, we have obtained this expression for the information rate distortion function R_I(D), where gamma is equal to the probability that X is equal to one. Therefore, when D is equal to zero, we have R_I(D) equals h_b(gamma), because h_b(D) is equal to zero, which is equal to the entropy of X, that is the source entropy. [BLANK_AUDIO] By the rate distortion theorem, if R is greater than R_I(0), that is entropy of X, the average Hamming distortion, that is the error probability per symbol can be made arbitrarily small. However, by the source coding theorem, if R is greater than the entropy of X, the message error probability can be made arbitrary small, which is much stronger. The reason is that in decoding the message, if one symbol is decoded incorrectly, then the whole message is decoded incorrectly. In other words, the error probability per symbol being small does not imply that the message error probability is small. On the other hand, if the message error probability is small, then the error probability per symbol must be small.