In this video, we will discuss how to adapt linear methods for classification problems.

And let's start with simplest classification problem, binary classification.

Here we have only two values for the target: minus one and one;

negative class and positive class -- so, essentially,

linear model calculus dot product between w,

weight vector, and x, feature vector.

This dot product is a real value,

and we should somehow transform it to minus one or one,

and to do it, we can just take a sine of dot product.

So linear classifier looks like sine of w transposed by x.

It has deep parameters.

And if you remember,

we agreed that there is also a constant feature that has a value of one on every example.

So we don't have to explicitly include bias into our model.

The coefficient of this constant feature will be the bias.

So actually, maybe there are d+1 parameters for this model,

and geometrically it looks like that.

Suppose that we have two features,

so x's on this graph correspond to our features,

and we denote negative class by red points and positive class by blue points.

And the linear model tries to find

some line that separates blue points from the red points.

And as we know from geometry,

the sine of the dot product indicates on which side of the line the point lies.

So, if we have a positive dot product,

then the point lies at the positive side of this line,

and if the product is negative,

then the point lies on the negative side of our line.

Okay.

Let's switch to multi-class classification problem with K classes 1- K. In this case,

we should use some more complicated techniques to build our classificator.

One of the most popular approaches is to build a separate classifier for each class.

So, for example, for the first class,

we'll have a linear model -- linear classifier -- that

separates points of the first class from all other points.

So essentially, we try to fit a model so that

points of the first class lie on the positive side of this line of this hyperplane,

and points from all other classes lie on the negative side of this hyperplane.

And the dot product of this model is essentially a score.

The higher the score,

the more the model is confident that this point lies in the first class.

Then we build such a model for every class,

and we have K linear models,

and each model calculates a score,

and then we assign our new example to the class that

has the largest score -- the class with higher confidence.

For example, if we have three classes,

and our score vector looks like 7-7.5 and 10,

then we assign our example to the third class,

because a third component of the score vector is the largest.

Okay. Now we have the model,

and we should somehow learn it.

So we need a loss function.

And let's start with the simplest loss function,

accuracy loss, and to define it,

we'll need some notation -- Iverson brackets.

They denote just basic square brackets,

and we write some logical statement inside the brackets.

If the statement is true,

then the value of brackets is one,

and if the statement is false,

then the value of brackets is zero.

So, now let's define accuracy metric.

Let's take an example xi,

find the prediction a of xi,

and compare it to the true value of class

yi and write the equality of predicted class and true class in Iverson brackets.

So, the value of the bracket will be one if we guessed the class correctly,

and then it will be zero if we are misclassifying these points.

And then we just average these brackets over

all our data points -- over all our training set.

So, essentially, accuracy is just a ratio of

correctly classifying points in our training set.

This metric is good and could be easily interpreted,

but it has two large disadvantages.

At first, we'll learn from our next videos that we need

a gradient to optimize our loss function effectively.

And accuracy doesn't have gradients with respect to model parameters.

So we cannot optimize it -- we cannot learn the model to accuracy score.

And also, this model doesn't take into

account the confidence of our model in this prediction.

So actually, we have a dot product of weight vector and feature vector w and x,

and the larger the score,

the more the model is confident in this prediction.

If this dot product has a positive sign and a large value,

then the model is confident.

But if the sign is positive,

but the value is close to zero,

then the model is inconfident.

And we want not only a model that makes

correct decisions -- that gets its classes -- but we want a confident model,

and it's known from machine learning that models with high confidence generalize better.

Okay. Accuracy doesn't fit,

so we need some other loss function.

Maybe we can use mean squared error.

Suppose that we have some example, xi,

and it belongs to the positive class,

to the class one,

and consider a squared loss on this example.

So we take dot product between w and x and compare it to one,

and take a square of this difference.

So, if our model predicts one,

then the guess is correct and the loss is zero.

If our model gives a prediction between zero and one,

then it's inconfident in its decision and we penalize it for low confidence.

If the model gives the value lower than zero,

then it misclassifies this point.

So we give it an even larger penalty.

That's okay, but if the model predicts a value larger than one, then we penalize it.

We penalize it for high confidence,

and that's not very good.

We should give small or zero loss for high-confidence decisions.

Okay. So we can just take one branch of

our squared loss and penalize for low confidence and for misclassification,

and give zero loss for high confidence.

Actually, there are many loss functions that look like this one,

and all of them lead to their own classification methods.

We'll discuss one of the most important-for-us methods, logistic regression.

And to talk about it, we should first find a way to

convert our scores from linear classifiers,

to probabilities, to distribution.

So, we have some vector of scores z,

which has components w transposed by x,

though these are scores for each of our classes.

Dot products can have any sign and have any magnitude.

So we cannot interpret them as probability distributions,

and we should somehow change it.

We'll do it in two steps.

At first step, we take first component of

our vector and take e to the degree of this component.

We do the same to the second component,

et cetera, to the last component.

So, after this step,

we have a vector e to the degree of z that has only positive coordinates.

So now we need only to normalize these components to get a distribution.

And to do it, we just sum all the components of

this e-to-z vector and divide each component by the sum.

And after that, we get a vector sigma of z that is

normalized and has only non-negative components.

So we can interpret it as a probability distribution.

This transform is called a softmax function -- a softmax transform.

Consider an example with three classes.

We score 7-7.5 and 10.

If we apply softmax transform to this vector,

then we get the vector sigma of z with components 0.05, zero, and 0.95.

So the first component was largest before the transform,

and it has the largest probability after softmax transform.

Okay. Now we have an approach to transform our scores to probabilities.

This is the predicted probabilities of classes.

And now we need some target vector,

the vector that we want our probabilities to be equal to.

Of course, we want the probability of the true class to be one,

and probabilities of all other classes to be equal to zero.

So,we'd form a vector, b.

It's a target vector that is

just a binary vector of the size K where K is number of classes,

and it has one in the component that

corresponds to the true class of the current example,

and zeros in all other coordinates.

Now, we have target vector b,

vector of predicted class probabilities sigma of z,

and we should somehow measure the distance between these probability distributions.

To do it, we can use cross entropy.

Essentially, cross entropy is

just a minus log of the predicted class probability for the true class.

And also, we can write it as a minus sum of the indicator that our class y equals

to K multiplied by log of the predicted class probability for

the class K. Let's look at some examples of cross entropy.

Suppose that we have three classes,

and our example belongs to the first class.

So, y equals to one.

Suppose that we have some model that predicts probability of one to the first class,

and zero probabilities to the second and third classes.

So, this model makes a correct guess,

and the cross entropy is zero,

because it's a perfect model for us.

If we have a model that makes a prediction of 0.5 for

the first class and 0.25 for two other classes,

then the cross entropy equals approximately to 0.7.

So there is some loss here.

But if I have a model that assigns the probability of one to

the second class and zero probability to the first class,

then the cross entropy equals to plus infinity,

because I multiply one by the logarithm of zero.

So, cross entropy gives

a very high penalty for models that are confident in wrong decisions.

Okay. Now we can just sum cross entropies over all examples from our training set,

and that would be our loss function.

It's quite complicated and we cannot find analytical solution for this problem.

So, we need some numerical methods to optimize it,

and we'll discuss this method in the following videos.

So in this video, we discussed how to apply linear models to classification problems,

both through binary classification and multi-class classification,

and discussed how loss for classification problems should look like.

One of the most important methods for learning linear classifiers is logistic regression,

and we discussed how a loss looks in this case.

And in the next video, we'll talk about

gradient descent numerical method that optimizes any differentiable loss function.