So next, let's talk about some house care application of RNN. So first one as using a recurrent neural networks for predicting heart failure onset. This is a paper published in JAMIA 2017. Heart failure prediction with RNN. The idea is if you observe a sequence of visit from a patient records, can you predict what's going to happen with this patient in future with respect to heart failure? In previous lecture, we have talked about how to set up a predictive model and you have a feature extracted from the observation window. Then you skip some time window that's considered prediction window and then the target event happened, in this case, heart failure or not happen here and comparing to the standard classification model, the main difference here is we actionably using the sequential information from all this multiple visit in the observation window to construct this prediction. So the input is a sequence of visit, for example, we have three visits here and every visit will be encoded by a multi-hop vector and if the event is present then the corresponding dimension will be one. If event is absent, then the corresponding dimension will be zero. For example here this four types of events are present, then the corresponding dimension are one and the rest are all zeroes. That's the multi-hop vectors that we're using for each visit. Then we can construct this simple RNN structure. So each input is this multi-hop vectors at the visit zero, this has visit 1, until you visit T minus 1 to visit T. Then each one of them will map to a hidden state h_T. The hidden state h_T can be determined by h_T minus 1 and the corresponding current input x_T, then finally, once we have the h_T, the hidden state of the last timestamp, then we can use this as feature vectors and train the logistic regression for predicting whether heart failure will be present or absent for this patient. Here's the data that we're using and it comes from Sutter Health for a particular study of heart failure and consists of 34,000 patients, 4,000 HF cases, and 30,000 are controls who do not have heart failure and the observation window we're using as 18 months. The case control selection criteria is, they're aged between 40 and 85, the type of diagnosis will be looking at and we have some requirement on the number of visits and the time-span between the diagnoses. These details are in the paper but if you take this dataset and apply RNN, in this particular case, they used GRU as the RNN model and you can see they perform better than more traditional classification method, such as logistic regression, support vector machine or multi linear perceptron and K nearest neighbors search. So for RNN we were taking into account the sequential aspects of visit by visit. Well, the traditional method, the features are just aggregate across all of it, mean that you have the one and the multi-hop vector for each visit and you just sum them up across all visit. For this patient use that as your input feature vector for the other baselines. You can see that GRU model performed better than those baselines. There's two variation of the settings. One is they also encode the duration between the visit. If visit 1 and visit 2 are between three months, the duration between them was three months then you just put that three months as one dimension in this input feature, along with all the other multi-hop vectors you have and pass that to your GRU model. You can see that the visit duration is slightly better, but not by much. That's first application of RNN. The second one is called Dr. AI, predicting clinical event via recurring neural networks. This is very similar compared to the previous one. Let's publish in machine learning for house care conference in 2016. The setting is, we want to do this disease per regression modeling or sequential disease progression, the setup is given the past visit, for example, visit 1 and 2 are the historical visit. You want to predict what's going to happen in the next visit, like visit 3. In this case the target is not just one particular disease, but any of those conditions, right? We want to predict fever will happen, X-ray will happen as well. Using the same model. We essentially use the same model RNN and input is this Monty hall vectors and map to a hidden state, h_T. Then from h_T we pass through a softmax layer to make this multi-class classification. Just to predict what's going to happen for all those different timestep. In this case, we're using a larger dataset from Sutter Health to 260,000. This dataset are captured across 10 years, and it consists of diagnosis code, medication code and procedure code, over 38,000 different binary code. The output labels are 1,183 diagnosis categories. Here's the performance of this task. When we compare them in this bar chart, the longer the better, and RNN perform the best. We compared with simple logistic regression and some heuristic like just using most frequent code from the history of this patient as your prediction, which actually performed pretty reasonable. The last visit rises use whatever happen in the last visit is status prediction for what's going to happen again. The metric here we're using the top-k recall, is just taking the number of true positive that we predicted in the top-k prediction then divided by the total number of true positives. For example, if a patient have in them when visit, have 10 diagnosis code. If we make a prediction, let's say 10 predictions. Among the 10 predictions, we predict correctly seven of them. This top 10 recall will become seven out of 10 as will be 70 percent. That's what, essentially what we're getting with RNN, a little bit above 70 percent. Well, the baselines are locked lower. Another interesting experiment we have done at us to see how generalizable are those models across different dataset from different hospitals. Here is just showing that we, this two curves are two different strategies to train the model. One is this red, one is just random initialization of the model as what we call cold-start, random initialization. The blue curve is perform better, that's using a warm start. The parameters are set with a models that learn from another dataset. In this case, we use set our dataset as the initialization and then train this on another dataset. You can see that warm start or this pretraining actually perform better. It lead to better performance if you neutralize them, better.