So, this means that Xn tilde is
the orthogonal projection of Xn onto the subspace spanned by the M basis vectors,
bj where j = 1 to M. Similarly,
we can write Xn as the sum j = 1 to
M of bj times bj transpose times Xn,
plus a term that runs from M +1 to D,
bj times bj transpose times Xn.
So, we write Xn as a projection onto
the principal subspace plus a projection onto the orthogonal complement.
And this term is the one that is missing over here.
That's the reason why Xn tilde is the approximation to Xn.
So if we now look at the difference vector between Xn tilde and Xn,
what remains is exactly this term.
So Xn minus Xn tilde is
the sum J = M + 1 to D of
bj times bj transpose times Xn.
So, now we can look at this displacement vectors
of the difference between Xn and its projection,
and we can see that
the displacement vector lies exclusively in the subspace that we ignore.
That means the orthogonal complement to the principal subspace.
Let's look at an example in two dimensions.
We have a data set and two dimensions represented by these dots and now we are
interested in projecting them onto the U1 subspace.
Well, we do this and then look at
the difference vector between the original data and the projected data,
we get these vertical lines.
That means they have no x component,
no variation in x.
That means they only have a component that lives in the subspace U2 which
is the orthogonal complement to U1 which is the subspace that we projected onto.
So, with this illustration,
let's quickly rewrite this in a slightly different way.
Going to write this as sum of J = M +1 to D of
bj transpose Xn times bj and we're going to call
this now equation E. We looked at
the displacement vector between Xn and it's
a orthogonal projection onto the principal subspace, Xn tilde.
And now we're going to use this to reformulate our loss function.
So, from equation B,
we get that our loss function is 1 over N times
the sum n = 1 to N of Xn minus Xn tilde squared.
So, this is the average squared reconstruction error and now we're
going to use equation E for the displacement vector here.
So we rewrite this now using equation E as 1 over N times the sum N
= 1 to capital N. And now we're going
to use inside that squared norm this expression here.
So we get the sum j = M + 1 to
D of bj transpose times Xn times bj squared.
And now we're going to use the fact that the bjs form an
orthonormal basis and this will greatly simplify this expression,
and we will get 1 over N times the sum n = 1 to capital N times
the sum J = M + 1 to D of bj transpose times Xn squared.
And now we're going to multiply this out
explicitly and we get 1 over N times the sum over
n times the sum over j times
bj transpose times Xn times Xn transpose times bj.
So, this part is now identical to this part.
And now I'm going to rearrange the sums.
So I'm going to move the sum over j outside.
So I'll have sum over
J = M + 1 to D times bj transpose.
So this is independent of n,
times 1 over N the sum
n = 1 to N of Xn times Xn transpose
and there's a bj from here missing times bj.
So I'm going to bracket it now in this way.
And what we can see now is that if we look very carefully,
we can identify this expression as the data covariance matrix S,
because we assumed we have centred data.
So the mean of the data is zero.
This means now we can rewrite our loss function using the data covariance matrix,
and we get that our loss is the sum over j = M + 1 to D of bj
transpose times S times bj and we can
also use a slightly different interpretation
by rearranging a few terms and using the trace operator.
So, we can now also write this as the trace of the sum of j = M + 1 to D
of bj times bj
transpose times S and we
can now also interpret this matrix as a projection matrix.