So I will start with a generic model for multimedia processing. [COUGH] Before we get to that, you would have already realized that communication of data is very expensive. Either we use a dedicated circuit switch network or on the internet. We saw in the previous course that wideband speech can take up to 224 kbps. In the CD quality audio, the sampling rate and the bits per sample for the increased dynamic range we have in CD quality audio, we need about 1.4 megabits per second. Usually the payload rates are higher when you increase the, include the overhead, like [COUGH] forward data correction, markers for random access in the media, so on and so forth. So let's just look at the content itself in the payload just so that we can keep things apples to apples. For still images, there are three components. Red, blue, and green streams that come from typical camera sensors. Each of them requires eight bits per pixel. So, if you have a 1024 by 1024 image, we need 3.14 megabytes for each image. In the case of video, we have similar images coming from the sensor attached at 24 frames per second, 30 frames per second, or even 60 frames per second in some of the new handsets. So you can see that the data rates will very quickly reach gigabits per second. So these rates are not really compatible with the Internet services. And we need to compress the data either for transmissions, storage and retrieval. So a generate block diagram that does this, [COUGH] is this. It leverages some of the tools we already discussed in the previous course. So for now, ignore the yellow blocks and follow the signal from source to destination. The time frequency decomposition block push the signals in a space where we can take advantage of known properties of the signal. The inverse time frequency decomposition typically results, if you do things correctly, in perfect reconstruction of the original signal, if we do not have any of the intermediate processes that are shown in the block diagram. In the case of voice, this was LPC analysis where we took advantage of the known properties of the source that is the human speech production system. This is not so in the case of image and video processing systems as we will see shortly. So in the case of audio, we actually take advantage of known properties of the sync, that is the human auditory system. What exactly do we listen to when we are listening to a CD? So the time frequency block itself does not give us any data rate reduction. It is the quantization block that gives us more steady reduction in the data read. The dotted feedback path in the receiver represents techniques such as backward adaptation we saw in the waveform coding, such as g.726 in course four. We also saw in the previous course that not all patterns that are current there in the text are the process data that comes with the Quantizer. They all have uniform distributions just like the English text. So Entropy Coding provides additional reductions in the data rate by taking advantage of this non-uniform distributions. The RTP block provides for real time transmission or the internet for storage and retrieval. There are similar protocols such as TCP/IP, HTML streaming, etc. The feedback path provided by RTCP around the communication block, it's very, very useful as we saw in the voice case. This becomes even more important for the video streaming case, because every short channeler has significant impact on the video quality. Now, let's take a look at the yellow blocks on the sender side. When there is temporal or spacial or spectral redundancy in the signals, we can actually predict what the next frame or the next sample is going to be. And then we can subtract this from the incoming data to reduce the bit rate further. This is what we did in the CELP case, that is the Code-excited linear prediction. And there are similar things that we can do for the other types of signals. So just to recap, here is the spirit of processing we do at the end quarter. The time-frequency decomposition puts the signal in a different space, so that we can take advantage of the redundancy in the signals and ignore the irrelevant parts. These are the two operative words. Redundancy and irrelevancy using our prior knowledge of the signals. For redundancy, think of a pure tone. It has all the samples in the time domain but you need only two samples to represent it very compactly in the frequency domain. Similarly if you have short transients in the time domain you get to have very wide spectral distribution. [COUGH] So you can have very compact representation in the time domain or transient and very compact representation in the frequency domain for tonal sounds [COUGH]. For the irrelevancy part, the more you know about the signals, the source and the sync, the more irrelevant parts that we can identify and take advantage of them as we do this process. And as we discussed before, the quantization step implements a lossy bitrate reduction scheme. We simply assign fewer bits to the parts that are less relevant based on what we already know. Then comes the Entropy Coding. There are many types of entropy coding systems that we can leverage in the pipeline. But, for now, I want you to realize, recognize, that the entropy coding is a loss-less compression and quantization is a lossy compression. [COUGH] So let's look at some of the options we have for the time frequency decomposition. We already saw LPC in the previous course. Other than that, the short-time fourier transform is the basic work hours because it doesn't make any assumption about the source or the sync. The picture at the bottom left shows the tiling of components in the STFT. The resolution in the time domain along the x axis [COUGH] and the frequency domain on the y axis. The magnitude in each of the ties in this representation is the intensity which is same as this pictogram pictures we saw before. The other common technique that is used a lot in the visual processing is the Gabor Transform proposed by Dennis Gabor in the late 40s. This is very much similar to STFT except it uses Gaussian window in the time domain before taking FFT on each of the sharp segments. We call that, if you use short windows in the STFT, we have very good temporal resolution, but this results in a very poor spectral resolution. On the other hand, if we want very fine spectral resolution, we need to have very long time domain segments. And this will end up at very poor temporal resolution. This is actually related to Heisenberg's uncertainty principle as applied to signal processing. In the Gabor transform, because we use a Gaussian window in the time domain, it is also a Gaussian window in the frequency domain. So, we have the most compact representation in this class of time-frequency representations. The next one that you'll come across a lot in JPEG, JPEG 2000, and other video processing is the Wavelet Transform. It uses a different basis functions, as opposed to Short-time Fourier, or Gabor. Instead of the exponential functions with constant window length anywhere in the tiling, we use much shorter windows in the time domain for higher frequencies. This results in very fine spectral resolution in the lower frequencies and very fine temporal resolution in the higher frequencies. In terms of that disease very similar to the human perception of audio and video signals. I just realized that the figure on the bottom right does not quite show the varying frequency resolution in the Y axis. I should remember to re-draw this in future. I do encourage you to take a look at the links here, at least even very briefly, to see if these are the topics you want to explore in future, because each one of them is very big. It can be a complete course or a lesson by itself. So, I put, for the entropy coding, two links up here for you to do some further digging. [COUGH] The Huffman coding, we talked about it in course four with respect to the English text. It is essentially a variable length code and it is computationally extremely simple. So it has been used in many, many practical implementations when the processing in the mobile platforms was still at very premium. So, simple example. Let's say we have four symbols, a1 through a4, with probabilities of occurrence and given some corpus, sorted in the descending order. So, you need a large representative database to generate these probabilities for all the symbols and real systems but let's just look at this form. In the process of building the Huffman tables, what happens is a1 gets assigned the shortest code word, it is 0 in this case. [COUGH] The Huffman is also a prefix code. This means that a2, a3 and all rest of the code works. They shall not start with the code word for a1. They all have unique prefix. So as you go to lower probability symbols, we have to find longer code words that do not have the earlier code words as the prefix.