In section 9.2, we discuss two specializations of the general alternating optimization algorithm, called the Blahut–Arimoto algorithms. We first discuss the algorithm for computing the channel capacity. Let r(x) times p(y|x) be a given joint distribution on alphabet X times alphabet Y such that, r is strictly positive. That is r(x) is strictly positive for all x. This is illustrated in the diagram at the bottom. Here r(x) is the input distribution to a channel p(y|x), where the random variable X is the input random variable and the random variable Y is the output random variable. Let q be a transition matrix from alphabet Y to alphabet X, that is, a transition matrix in the reverse direction. Now consider this maximization of a double summation. Summation x, summation y, r(x) times p(y|x) times the log of q(x|y) divided by r(x), where the maximization is taken (over) all q such q(x|y) is equal to 0 if and only if p(y|x) is equal to zero. This double summation looks rather complicated but note that all the quantities in blue are given. Lemma 9.1 says that the maximum of this double summation is attained by a transition matrix q star, where q star x given y is equal to r(x) times p(y|x), divided by summation x prime, r(x prime), p(y|x prime), where the numerator here is the joint distribution of X and Y, and the denominator here is simply the marginal distribution of random variable Y. And so q star is equal to the conditional distribution of X given Y. That is the maximizing q is the reverse transition matrix that corresponds to the input distribution r, and the transition matrix p(y|x). Now, in the double summation on the right hand side above, r(x) times p(y|x) is simply q(x,y) and for the fraction inside the logarithm, the numerator q star is equal to q(x,y) divided by q(y), and the denominator, r(x) is equal to q(x), the marginal distribution of random variable X. So, this fraction is equal to q(x,y) divided by q(x) times q(y). And so, the double summation is summation x, summation y q(x,y), log of q(x,y) divided by q(x) times q(y), that is the mutual information between an input variable X and an output variable Y. [BLANK_AUDIO] We now prove lemma 9.1. In equation two, let the denominator be w(y). Assume without loss of generality, that for all y, p(y|x) is greater than zero for some x. Otherwise, we can simply delete that y from the alphabet Y. Since the input distribution r is strictly positive, w(y) is strictly positive for all y and hence, q star x given y in equation 2 is well defined. Rearranging equation 2, we have r(x) times p(y|x) is equal to w(y) times q star, x given y. For any reverse transition matrix q, satisfying the constraints in equation 1, consider the double summation evaluated at q star minus the double summation evaluated at q. Now, this r(x) and this r(x) cancel with each other. And upon combining the 2 double summations, we obtain summation x, summation y, r(x) times p(y|x) log q star x given y divided by q(x|y). Using the relation in step four, we replace r(x) times p(y|x) by w(y) times q star x given y. Now, w(y) does not depend on x, so, it can be moved outside summation x. And the inner summation that is summation x q star x given y, log q star x given y divided by q(x|y) is simply the divergence between q star x given y and q(x|y). And evidently, this is greater than or equal to zero. The proof is completed by noting that q star as defined in equation two, indeed satisfies the constraint in equation one. It is because w(y) is greater than 0 and r(x) is also greater than 0. Therefore, q star x given y is equal to 0, if and only if p(y|x) is equal to 0. That is, the constraints in equation 1. Therefore, q star as defined in equation 2 is indeed the maximizing q. We remark that the maximizing q in lemma 9.1 is indeed unique. Toward this end, we note that in order for the inequality in step five to be tight, q(x|y) must be equal to q star x given y, for all y, which implies that q must be equal to q star. [BLANK_AUDIO] Theorem 9.2 says that for a discrete memoryless channel, p(y|x), the channel capacity is equal to the supremum over all input distribution r that are strictly positive, the maximum over all q of summation x, summation y, r(x) times p(y|x) times log of q(x|y) divided by r(x), where the maximization is taken over all q that satisfies equation 1 in lemma 9.1. Note that by lemma 9.1, this maximization is actually equal to the mutual information between the input and output of the channel. Compare with the original definition of the channel capacity, we see that we have replaced the maximum over all input distribution r, by the supremum over all strictly positive input distribution r. We now prove theorem 9.2. Since the mutual information between X and Y depends on the input distribution r and the transition matrix p of the generic channel p(y|x), we write I(X;Y) as I(r,p). Then C is equal to the maximum of I(r,p) over all input distributions r. By lemma 9.1, we see that the maximum of the double summation over all q in the formula in theorem 9.2 is indeed equal to I(r,p). Therefore, we need to prove that the maximum of I(r,p) over all r, is equal to the supremum of I(r,p) over all strictly positive r. That is the maximum over all input distributions can be replaced by the supremum of all strictly positive input distributions. Let r star achieve the above maximization. If r star is strictly positive, then in C, the maximization over all input distributions r can be replaced by the maximization over all strictly positive input distributions r. This in turn can be replaced by the supremum over all strictly positive distribution r. Next, consider the general case that r star may not be strictly positive. That is, there may exist zero probability masses in r star. Since I(r,p) is continuous in r, for any epsilon greater than 0, there exists a delta greater than 0, such that, if the Euclidean distance between r and r star is less than delta, then the gap between C and I(r,p) is less than epsilon. In particular, there exists tilde{r} greater than 0, such that the Euclidean distance between tilde{r} and r star is less than delta. This is illustrated in the figure below. Here, r star is on the boundary of the probability simplex. That is there are some zero probability masses in r star. This is the case of interest because the other case, that is r star is strictly positive has already been treated. Now, consider a ball of radius delta with the center at r star. And tilde{r} is a probability distribution within this ball. Following the definition of C, we have C equals the maximum of I(r,p) over all input distributions r which is greater than the supremum of I(r,p) over all strictly positive r. Since tilde{r} is strictly positive, this, in turn, is greater than or equal to I(tilde{r},p). As tilde{r} is within the distance delta from r star, we have C minus I(tilde{r},p) is less than epsilon or I(tilde{r},p) is greater than C minus epsilon. In the above chain of inequalities, by letting epsilon goes to 0, this goes away and so, we conclude that the supremum of I(r,p) over all strictly positive r is indeed equal to C. This completes the proof of the theorem.