0:00
So one issue with either the SVD or
principal components analysis is miss, is always missing values.
So real data will typically have missing values, and the
problem is that if you try to run the SVD
on a data set that doesn't have or that has
some missing values, like you see that I've created here.
You can see that you get an error.
You just can't run it on a data set that has missing values.
So you need to do something about the missing values.
Before you run an SVD or a PCA.
0:26
So, one possibility and there are many others though, is the,
use the impute package which is available from the bioconductor project.
and, and just impute the missing datas, sorry the missing data points and so
that you can have a value there, and then you can run your SVD.
And so this approach you.
This code here uses the impute.knn function which takes a, a missing row or
missing values in a row, and imputes it by the k nearest neighbors to
that row.
So if k, for example, is five, then it will
take the five rows that are closest to the row with
the missing data, and then impute the data in that
missing row with the kind of average of the other five.
And so, once we've imputed the data with
this impute.knn function we can run the svd.
You can see, it runs without error.
And we can, we can kind of plot the first singular
vectors from each of them.
And so you can see that the on the left hand side, I've got the,
the data, the, kind of the, the
first singular vector from the original data matrix,
and the second, on the right-hand side,
I've got the, server singular vector from the
data matrix that was in, that was kind of where the missing data was imputed.
Now, you can see that they're roughly similar.
They're not exactly the same, but the imputation
didn't seem to have a major effect on
the on the running of the svd. So
1:46
this is a final example here, it's just kind of an interesting example.
I just want to show how you can take an
actual image, which is represented as a matrix, and
kind of and, and, and kind of develop a
lower dimensional or lower rank representation of this actual image.
So here's a picture of a face.
It's a relatively low resolution picture of a face, but you can see that there
is a, you know, a nose and ears and two eyes and a mouth there.
And so what we're going to do is we're going to run the svd
on this face data and look at the variance explained.
And so you can see that the first
singular vector explains about 40% of the variation.
And then the second is about say 20 some percent.
And then the third one is about maybe 15%.
And so if you look at say the first five to ten singular
vectors they capture pretty much all of the variation in the data set.
And so we can, we can see, we can actually
look at the image that's generated by say the first
singular vector, or the first five, or the first ten.
2:53
that is, that uses fewer components than the original data set too.
So here I'm creating one that just uses
the first principle component the first singular vector.
I'm using one that takes the first five altogether.
And then another one that takes the first ten.
And, and so we can take a look at what this approximation looks like.
So the first image here, which is all the way
on the left here, just uses a single singular vector.
And you can see that it's not a very good, it's not a pretty picture so to speak.
There's not really a face there.
There's not much you can see.
But it's asking a lot to represent an entire image using just a single vector.
So, if we move on to the second one from the left, you
can see that basically most of the, of the key features are already there.
So this uses the first five singular vectors.
And you can see that clearly there's a face.
There's two eyes, a nose, and a mouth, and two ears.
3:49
If you move on to the next picture, which is
letter C, here, you can see that it's a little bit,
kind of has a little bit more definition.
This is using the first ten pixel, singular vectors.
And, but it's not very different from the second one, which uses, only used five.
And then the very last one here on the right is the original data set.
So, you can see that if you go, if you use
just a few, a singular vector, maybe up to five or ten.
You can get a reasonable approximation of this face without
having to kind of store all of the original data.
So, this is an example
of a kind of data compression type of approach.
that, that, the singular value decomposition can can generate.
Now data compression and kind of statistical summaries
are kind of two sides of the same coin.
And so if you want to summarize the data set with the, with a,
with a smaller number of features the
singular value decomposition is also useful for that.
4:44
So just a couple of notes and further resources
for the singular value decomposition and principle component analysis.
One of the issues is that the scale of your data matters.
So if you have, for example, it's common to
measure lot's of different variables that come on different scales.
And that can cause a problem, because if one
variable is much larger than another variable, just because
the unit's so different that will tend to drive
this, the principle components analysis over the singular vectors.
And so that may not be particularly meaningful to you.
And so you want to look at the, see that the, kind of the
scale of the different columns or rows are roughly comparable to each other.
5:21
the, as we saw in the, one of the, in the example with the
two different patterns, the principle components and
the singular vectors may mix together real patterns.
And so, the patterns that you see may not represent the kind of
separable patterns but they may be
patterns that may, that are mixed together.
The singular value decomposition can be computationally intensive if you have a
very large matrix so that's one, that's something to keep in mind.
We use relatively small matrices here, but
of course computing power is getting ever more
5:48
powerful and and there there are some highly optimized and
specialized matrix libraries out there
for computing the singular value decomposition.
And so this can be done on, on lots of kind of practical problems.
6:04
And so here are a couple links to kind of further resources
for what, how to use principle
components analysis in the singular value composition.
And, and also there are other kind
of approaches that are similar to this.
But are a kind of different in many of the details.
You may hear about these approaches.
Things like factor analysis, independent
components analysis and latent semantic analysis.
And these are worth exploring, but are related to the, kind
of the basic ideas behind principle
components analysis and singular valued decomposition.
Which is that you want to find the kind
of lower dimensional representation that explains most of the variation
in the data that you see.