Hi, I'm Mark. I'm a lecturer in statistical machine learning at Imperial College. In this course, we are going through the mathematical foundations that we need in order to do dimensionality reduction with principal component analysis. Data in real life is often high dimensional. For example, if you want to estimate the price of our house in a year's time, we can use data that helps us to do this. The type of the house, the size, the number of bedrooms and bathrooms, the value of houses in the neighborhood, when they were bought, the distance to the next train station and park, the number of crimes committed in the neighborhood, the economic climate and so on and so forth. There are many things that influence the house price and we collect this information, the data set that we can use to estimate the price of our house. Another example is the 640 by 480 pixels color image which is a data point in a one dimensional space, where every pixel corresponds to three dimensions, one for each color channel; red, green and blue. Working with high dimensional data comes with some difficulties. It is hard to analyse, interpretation is difficult, visualisation is nearly impossible and from a practical point of view, storage can be quite expensive. However, high dimensional data also has some nice properties. For example, high dimensional data is often over complete that means many dimensions are redundant and can be explained by a combination of other dimensions. For example, if we took a color image with four channels, red, green, blue and a gray-scale channel then the gray-scale channel can be explained by a combination of the other three channels and the image could equally be represented by red, green and blue alone as we can reconstruct the gray-scale channel just using that information. Dimensionality reduction exploits structure and correlation and allows us to work with a more compact representation of the data ideally without losing information. We can think of dimensionality reduction as a compression technique similar to JPEG or MP3, which are compression algorithms for images and music. Let's take a small binary image of the handwritten digit eight. We're looking at 28 by 28 pixels, each of which can be either black or white. Therefore, this image can be represented as a 784 dimensional vector. However, in this example, the pixels are not randomly black or white, they are structured. They often have the same value as the neighboring pixels. In this data set, there are many examples of an eight. They differ a little bit but they look sufficiently similar that we can identify them as eight. We can use dimensionality reduction now to find a lower dimensional representation of all eights that is easier to work with than a 784 dimensional vector. The lower dimensional representation of a high dimensional data point is often called a feature or a code. In this course, we will look at a classical algorithm for linear dimensionality reduction, principle component analysis or PCA. The purpose of this course is to go through the necessary mathematical details to derive PCA. In the first, module we will look at statistical representations of data, for example, using means and variances. And we also describe how means and variances change if we linearly transform our data set. In the second module, we will look at the geometry in vector spaces and define how to compute distances and angles between vectors. In the third module, we'll use these results to project data onto lower dimensional sub-spaces and in the fourth module, we'll derive PCA as a way to reduce dimensionality of data using orthogonal projections.