In the last video of the last week,
we talked about Gaussian mixture models and how
they can be interpreted as models with a hidden state.
That gives the component that generates a given data point.
In this case, model estimation amounts to estimating both the hidden state s,
and means and variances of all Gaussian components.
We also said that this problem can be solved using
the EM algorithm that iterates between two steps.
The E step estimates the hidden variables given the observed variables.
The M step maximizes the low bound
on the likelihood of data by tuning all parameters in the model.
Now, I want to walk you through a few other latent variable models,
which will eventually pave the way to
our discussion of reinforcement learning in the second lesson of this week.
First, let's talk about factor analysis.
Let's assume we have data in the form of T observations over N-dimensional vector y.
Vector analysis seeks a decompositional signal y
into a weighted sum of some hidden or latent variables X,
that are assumed to be Gaussian with zero mean and unit variance, plus,
an N-dimensional white noise epsilon and which has a diagonal covariance matrix psi.
The equation reads y equals lambda x plus epsilon where lambda is emetic of size n by k,
and x is a K-dimensional vector.
Lambda is called the factor loading matrix as it gives weights
of different factors or components of X in the final observed value.
Now, because both x and epsilon Gaussian Y will also be Gaussian with
zero mean and its variance can be expressed as lambda times lambda transposed plus psi.
Now, what is the point of such the placing of
one Gaussian variable as a sum of two such variables?
Well, one reason for this is that we see that
our decompositional y provides a compact set of model parameters.
Indeed the number of tree parameters to describe a covariance matrix over
a general and dimensional vector y is n times n minus 1 divided by 2.
But the factor model leaves on the n times k parameters for matrix lambda
plus a k more parameters to describe variances of components of epsilon.
This makes N times K plus 1 parameters which is much less than the first series.
Also, we can compute the distribution of
hidden factors in factor analysis conditional on the observed data.
If we take a low number of factors,
this can also provide the low dimensional representation of your data.
There are also some subtle points about factor analysis.
The first point is that as it stands,
the factor model does not give a unique solution for lambda and x for given y.
To see this, let's assume that
the two given variables lambda and x are such that they provide a best fit to data y.
Now, assume that we have an arbitrary orthogonal matrix U,
so that U times U transposed equals a unit matrix.
Now, because U times transposed equals identity matrix,
we can ensure that just in between of w and x in this equation.
But then, we can rename the product lambda U as
lambda hat and also rename the product lambda transpose times x as x hat.
So, we get that the factor decomposition has
the same form as above but with different parameters around their necks.
It means that lambda and x are not unique for a given y.
To resolve this, some constraints need to be added to the definition of the factor model.
Usually, the factor loading matrix seeks constraint to be orthogonal.
One motivation for this is provided by a link
that exists between factor analysis and the PCA.
Namely, if we take the noise covariance matrix to be proportional to a unique matrix,
then it turns out that
the maximum likelihood estimation of the model produces the result that
the factor loading matrix should be a matrix of eigenvectors
of covariance matrix or y, stored column-wise.
Factor analysis model estimation can be done using the EM algorithm.
In a E-step, it estimates hidden factors
x while keeping the parameters from the previous step.
In the M-step, it adjusts the model parameters
by maximizing the low bound on the log likelihood.
Another latent variable model that I
wanted to briefly discuss here is called Probabilistic PCA.
Probabilistic PCA is a special case of factor analysis when we
take the noise covariance matrix to be proportional to a unit matrix.
Such noise is called isotropic.
Probabilistic PCA provides probabilistic generalization
of the conventional PCA exactly as its name suggests.
It can come very handy in many different, practically important situations.
For example, when your data has missing values or when
only certain components of the data vector are missing at some days.
Both these situations are very commonly encountered in finance.
Probabilistic PCA can work with such incomplete data
or it can be used to fill in missing values in data.
Also, because it's probabilistic,
it has much less issues with the outliers and noise in the data than the regular PCA.
Finally, the conventional PCA is produced from the results
of probabilistic PCA if you take the limit of sigma goes to zero.
Again, the probabilistic PCA can be estimated using
the MLE for the maximum likelihood estimation.
In the E-step, the algorithm makes inference of
the hidden state while keeping the model parameters fixed from the previous iteration.
In the M-step, it keeps the distribution over hidden nodes fixed and
optimizes over model parameters by maximizing the low bounds on the log likelihood.
Okay. This concludes our quick excursion into latent variable models.
Let's quickly check what we learned and then move on.