Now, let's see how machine learning can

address the problem of predicting bank failures.

The model we want to try here is

logistic regression that we introduced in this lesson.

To remind you the formula of

logistic regression, here it is.

If X is a vector of all features

of a bank such as financial ratios or

macroeconomic variables and W are adjustable weights,

then the probability of a bank failure

is a Sigmoid function of the sum.

So let's discuss our dataset and

features that we want to use for logistic regression.

First, our data set includes 471 banks

that failed between 2001 and 2015.

In addition, it has

9,375 non-defaulted banks for the same time period.

So our data is very unbalanced.

We have much less positive examples

than negative examples.

I mean, positive according to our terminology not

positive for banks or the FDIC events.

We can balance the data

by using what is called downsampling.

In doing this, we keep all records for

failed banks one year prior

to the failure and in addition,

keep about equal number of records for non failed banks.

So let's keep 500 random records for

non-failed banks as for times of these records.

They can also be sampled randomly among cold days

corresponding to one year prior

to failure for all failed banks.

As a result, we have a balanced downsized dataset of

about 1,000 records for the failed and non-failed banks.

Now, let's talk about

the features we use for this problem.

The data set that we have contains a number of

financial ratios such as net income to total assets,

non-performing loans to total loans,

logarithm of total assets and so on,

as well as some macroeconomic factors

such as the GDP growth,

stock market growth and so on.

All these predictors can be used in the present problem.

Though it turns out that some of

them are very important while

others have a low predictive power

and can therefore be skipped all together.

Finally, we have to make a test dataset.

This can be done by randomly splitting

our dataset into the train and test data sets.

In the experiments that I will show you next,

I had 310 failed banks in

a train dataset and 161 failed banks in the test dataset.

Now, before looking at the results of

such logistic regression model for bank failures,

let's just take a look at the data itself.

In these graphs, I show you scattered plots of

various financial ratios for

different failed and non-failed banks.

Each point on the graph has two coordinates.

The x coordinate is the logarithm of

total assets for the bank.

The y coordinate is a particular financial ratio.

Failed banks are painted red

while non-failed banks are shown in green.

As you can see here,

the red points are nearly linearly separable

from the green points except for a couple of outliers.

These pictures on their own,

should make us quite optimistic about

the results that we expect from

logistic regression for this problem.

This appears a very clean

data relatively rare case in finance.

In accordance with these expectations based

purely on visualization of the data,

we find that logistic regression

works very well for this problem.

The graph on the left-hand side shows you

the so-called ROC curve for this problem.

We did not talk about such metrics

such as ROC curve and the related measure called the

area under curve or

AUC as we'll cover it in

more details in our course on supervised learning.

But qualitatively, the steeper

the curve goes on this graph the better.

The accuracy score which we did explain in this course is

96 percent which is

an excellent result for models of this sort.

The graph on the right shows you the decrease of

the test error obtained with

the TensorFlow implementation of

logistic regression for this problem.

Finally, the graph on

the bottom has to do

with the problem of feature selection.

There are multiple ways to select

the most predictive features

for a given machine learning problem.

One of the simplest ways is to look at

the p-values of

different predictors in logistic regression.

The graph on the bottom

illustrates another approach to feature

regression that is based on the use

of an algorithm called random force.

This algorithm which we will discuss

in our course on supervised learning,

provides an alternative model

for predicting bank failures.

It turns out that it works as well as

logistic regression for this particular problem.

But in addition to providing

an alternative predictive model for the same problem,

random forests can also be used to

find the most important features in our problem.

Each feature is represented by a bar on this diagram,

and the height of each feature indicates

the importance of this feature for the problem.

As this diagram suggests,

there are only a few features among all features that are

present in our dataset which are really important.

I'm not showing you which features

are the most important ones.

To find this, will be a part of

your homework for this week.

Where you will analyze

the problem of bank failures among other assignments.

So bank failures was

our first use case for classification methods in finance.

There are also many other financial applications

for probabilistic classification models.

For example, predicting consumer defaults

on credit cards or mortgages.

It can be done using the same methods.

In trading, some tasks are commonly

formulated as classification problems as well.

For example, for value

investing that we discussed earlier,

all stocks can be classified

into undervalued and not undervalued.

Respectively, when such classification is done,

you can use it to come up with an investment portfolio by

buying the most undervalued stocks

and selling the most overpriced stocks.

In your homework for this week,

you will develop your practical skills

in terms of floor by working with

a neural regression and classification models

using equity fundamentals data and bank report data.

The Jupyter Notebooks that you will be working on in

this assignments will be based

on the notebooks that we used in our demos.

So this was our very busy Week

2 that was devoted to supervise learning,

and its uses in finance.

In the next week,

we will talk about unsupervised learning.

Good luck with your homework,

and see you next week.