Hello everyone. Today we're going to continue with our discussion of data mining project, particular we'll be focusing on some examples. So at this stage, our main focus is identifying the key components of a real world data mining projects. And also start thinking about how you may propose your own project for data mining. So what I'm going to do today is to really give you kind of an overview of some of the projects I have worked on in my research work. Okay. So do you keep in mind because these are research oriented projects. So you don't need to worry about the technical details or the depth. But hopefully those projects will give you a reasonably sense of how data mining can be used in some very different scenarios. So these are really just meant to give you maybe some general idea or like helping you to see how to come up with your own projects. But do keep in mind your course project will be very different from those like research projects that I'm going to cover today. So starting point, my general research is about full-stack data analytics. So what that really means is that I'm trying to integrate the systems algorithms and application perspective together while I'm trying to work on some kind of data analytics problems. So over the years I have worked on many different projects and the applications are very different, but I can roughly kind of put them into these two categories. One is more about the data driven scientific discovery and the other one is more about data-driven ubiquitous computing. So the latter usually refers to kind of just people, computation related to people's daily life. So of course we're talking about data analytics. So you got to start with the data. So always think about like what kind of data you have, how you want to use different types of data. So when you have data, there are usually two related perspectives. One is about data management. So that's just like okay, I know these are the kind of data I need to get and this is how we're going to manage it then so that I can support various types of the data exploration on top of that. And then in this process as we have actually covered in previous lectures. So I always think about the efficiency and the effectiveness angle of your solution. Because you want a solution of course is effective. That means you're finding good patterns, you have accurate models and all that. But also efficiency is important because nowadays most of the data mining projects or scenarios are dealing with some large data. So you really need to think about whether your solution is efficient. And to do that, this usually cause for innovations in system design and aggregate design. So you may have systems that can efficiently compute various kinds of patterns or do various kinds of analysis or modeling. But your algorithm designed itself also, of course can contribute significantly to both the efficiency and effectiveness of your solution. And then with all that on the side I have this two main categories. But as you will see in some of my examples, almost all the projects are interdisciplinary, meaning that you are trying to leverage some kind of like a domain information in the process. Even though of course my research is more on the kind of data analytical side and some of the data science. But then you are trying to leverage whichever domain you are in. So that that information can help you to come up with a better solutions. So let's look at two concrete examples. When I say full-stack data analytics. The first one is Plug-in Hybrid Electric Vehicles. So as some of you may know. So when we say, electric vehicles of course that's like the vehicles with kind of battery system. Well, with the plug-in hybrid, the general idea is that you would have both the internal combustion engine. So that's just like a regular fuel powered vehicle, but also you have this battery system. So the jointly can then support your vehicle. And also because the battery system can also be charged. And the plug-in settings that you can actually recharge your vehicles. So then you have this pretty complex operational scenarios. So, a big part of the question is actually about being able to classify. So you can see this as a classification problem because there are different most for this vehicle to be operating on. You may have just the internal combustion engine powering the vehicle or just the battery system powering the vehicle, or both of them powering the vehicle together. Or you could be charging the battery system. So they are very different states and those are the classes we're trying to determine. Once you know that then you actually can then compute various information regarding the performance of your vehicle. So specifically in this case, so we start with of course data. Where do we get the data? So in terms of ground to choose data, what you can do is that there's this interface is called on board diagnosis. So you can basically connect this interface to your vehicle and you can read out a lot of kind of like detailed internal information about how your vehicle is operating. So that's a ground choose. It's very helpful in the modeling, but of course when you're running it, you don't have everybody just plug-in that to read the vehicle at the wrong time. So instead we designed a system to use smartphones. So as you can see here, so your smartphone may be mounted on the windshield. You can actually also just living in your car anywhere. But the smartphone will be able to capture various emotion information. And using that we can then model so that is a classification model we're talking about. So we take the phone information and then be able to classify which mode of operation this vehicle is in at any given time. And with that then you can actually diverse kinds of modeling to then for example computing the fuel consumption. Because if you know how the vehicle is operating and some of the kind of physical model that can integrate, then you can actually estimate how much fuel consumption. And this also actually shows how different drivers can have very different fuel efficiency and also the CO2 images depending on how are your driving your vehicle. Related to that again by knowing the mode of operation, you can estimate then remaining capacity of your battle resistance. And that is actually very important from the economic angle because if you're buying electrical vehicle, you want to know how your battery system would maintain its capacity over a longer period of years. So here saying, we are able to estimate over a period of 15 years, depending on how you're driving. So we're showing that among the eight different drivers, you already see quite a bit of difference depending how they drive. So those are again based on the core estimation of the operation most. Okay, they don't know we also have actually a larger scale study. So this is a nationwide study in the United States, and then we had data where this is a thing of our 400 plus vehicles, okay, being driven just by volunteers. So they basically each of them just have a vehicle that they drive all the time. But then we can then collect detailed information about how the vehicles are driving, but also in terms of when they're being charged, okay? So this actually also gives us a very useful perspective, right? You think about those are kind of like 400 vehicles. They have all the details like recordings of their driving behavior, right, their running time and when the vehicle is being charged at which location, okay? So as another example, okay, so we talk about electric vehicles. So that's more kind of scientific discovery site, but the other one is more on the kind of your biggest computing. So look at the user aspect, okay? So here we're talking about a group of event scheduling, okay? So I don't know how many of you have, like had this kind of scenario where say a group of friends or colleagues or family members. You don't just go out do something, right? So there's always this kind of benefit of being able to schedule your events, okay? So what you can do, of course, is that you can either somebody makes a decision and then that's it or versus the people can vote for different choices, right? So here we have actually developed a mobile app for people or as a group of users, right? You can then vote this physical time where you want to go, and also like people can vote or like the download or and download to the different choices. So that just makes it easier for a group to make decisions, right? But then what's the data mining in this process, right? So ultimately we like to make recommendations, okay? So we wanted you to say that based on whatever data we have observed, we would like to recommend this is actually likely venue or time that would work for the whole team, okay? But what kind of information do you use, right? So this act again gets to this kind of like the future aspect, right? because the mobile app allows us to collect a lot of information, okay? That's great, okay, but then when you have the information you have basically like you have the individual group members of location traces, so you know where they go generally, okay? That actually can give us some idea about the location clusters. Okay, so this is a classroom step, right? So you take the location traces and trying to identify location clusters where your group members are more likely to go, okay? And when you have that information, you also have this notion about the like see location for minorities. For example, this is my home area, that is my work area, and then there's one kind of shopping area I go regularly, right? So you tend to see those are kind of the clusters that an individual users may have different levels of similarity with, okay? You can also just collect just general information about like, well people's preference about if your drive or you go out regularly versus why usually don't go very far, that kind of thing, right? And then you can look at them this kind of like similarity okay, between people because it's not just that you have five people in your group and you take the average of the group of members preference, right? Instead you want to look at the kind of within the group, they're like similarity across the group's members. For example, like if you look at the location traces right, you maybe to say, yeah, well you tend to be kind of like a similar along those time and the space like areas. Then that again gives you a better sense about how likely that those members may agree to a particular location and a particular time period, okay? So this is the case of the high level idea. So you're trying to make a recommendation okay? Or in a way predicting how likely the group would like that particular venue and time period. But you're leveraging different types of features by constructing clusters and by looking under some of the similarity or correlations right across those attributes, all right? Okay, so as I said, like when you look at the general in terms of food stack data and the data exercise. You look at the systems layer where I have worked on many different problems, but generally leveraging mobile wearable IoT sensing to collect the data, then on top of that analyzing the data for various kinds of application scenarios, okay? From the algorithms side, a lot of data I work with are kind of special temporal data but also you're trying to argument with many other data modalities. So one main focus upon the algorithms angle is about match model data fusion, right? How do you take very different types of data and fuse them together so that you can have like you get the really the joint benefit of those different data modalities, okay? Anomaly detection is also meant focus of my research and related to that recommendation. There are many applications in terms of data mining, right? As I have said like right data mining is very useful in many many domains and like in my research I have worked with quite a few domains, okay? Then, so from the scientific side, like a lot is about environmental data, okay? So you look at the air quality, you look at like just generally remote sensing data. Seismology, so this is more about earthquakes, okay, which I'll touch upon later. Oceanography, so this is about sensor or sense the data in the ocean. And again you're trying to detect like fish, fisher schools or other species, okay? Material science, this is about designing new materials and big part of that is about you to identify relationships across the different kind of properties of like new materials, or how you're composing different elements to build a new materials. Renewable and sustainable energy, so our electric vehicle is one example but also like solar farms or wind farms in the areas where you have a lot of data, and most of them like spatial temporal, right? And of course has, like actually a lot of them are very different modalities. And then transportation electrification, so that's just general refers to like EVs or, [COUGH] actually it's not just the electric vehicle themselves. But also it's the whole kind of like charging infrastructure and also gets to the traffic aspect and also gets to the human, because you need the human production. So from the ubiquitous computing, right, there are again many things you can do, right? So generally you're trying to understand people. Okay, so you want to give to like detect to specific activities that the users are performing. So that's more like a classification, right? Also you basically may try to profile individual users or groups behavior, right? So this again gives you some pattern in terms of your modeling, right, of users interest, okay? And then there's the context of where computing also gets to a little better in terms of the not only the pattern, but also it's about contextualized pattern. So like under specific time or under specific location or other conditions, right? Then you have a particular pattern happening, okay? So this also connects with this kind of like frequently pattern analysis, but also a little bit of the kind of association rules, right? Because they're saying this is happening, then what's the likelihood of something else happening, okay? Also related data like to analyze events, so it's not just users, but also you look at how users behave in different like under different events, okay? Cyber safety refers more to the kind of the detecting negative cases, okay? So the techniques are similar in terms of you're still using a lot of kind of data mining capabilities, but then here you focus more in terms of Things that are not good. So this could be fake news, this could be cyber bullying or like the scenarios that like some some users or some information is different, okay or suspicious in some way. A lot of that are related to the online social networks right? Where you can actually get a lot of data and there are actually some very interesting questions to be explored using various data mining approaches. Okay, so let's look at a few kind of concrete examples. Okay, as I have said, right, these are research projects. Okay so you can ignore like many of the technical details, but really kind of thinking about how data mining is being used in the different project settings. Okay, well obviously I really want to kind of hide like when I put in this bridge right in between scientific discovery and you big computing and also something want to like push you to think about as well. Is that I think about the connections right between the two kind of larger kind of like areas, okay with a lot of that. A lot of that times you see scientific discovery being just what the scientists are does right? But versus individual users. But if you see in my example of electrical vehicles, right? Of course on one hand, I'm trying to model the design, model the performance of the electrical vehicles, but also knowing that that is actually significant impacted by how the users will drive, right? So there's actually some very like close connections between these two. So just always be open minded and see from the application scenarios, how things could be useful when you're looking at from different perspectives and also how different types of data may be tied together to answer specific questions. All right so first let's look kind of more like this is more like persistence layer, right? So here has a set of like, we are using very different types of like data collection mechanisms, right? Mobile variable reality computing. So you're sensing a lot of data, you're collecting a lot of data. But then on top of that you can do many different things, okay. So one example is air quality monitoring, okay? [COUGH] In this case we actually design our own device which is fairly small, right? But this device allows you to collect like air pollution information or air pollutant concentration in like whenever you go because you basically carry this with you. Okay and using this we are able to identify in the like campus setting. Also like a home setting that you can quickly identify like which or when and where you tend to see like higher concentration concentration of particular types of air pollutants. And then you can actually also find correlations okay, between different types of like air pollutants, okay. Another example, this is actually for running. Okay, so what we're trying to do here is that we're trying to provide a kind of like feedback, right? To runners in terms of their running form. So like how they're running, whether their position is good, whether the timing or that, right? So there's some very detailed information we can extract by using the kind of the motion sensors. Okay, so you have the accelerometer information, you have the general scope and that with the detailed time information and then some kind of like analysis of the peaks and also the changes. So you look at both of the general pattern, but also you can look for changes over time because apparently as you become like exhausted or depending on the actual like weather conditions, right? So then maybe actually changes in your pattern as well. So that the like high level from the data mining perspective, again, you're trying to capture the different like state okay of the runner's performance and then be able to see changes or like anominees in that process. Then the next example is about exercising. Okay, so here we are using TX RX these are like transmitters and receivers. Okay so these are Wi-Fi signals, okay? So you're using Wi-Fi signals but when a user performs certain types of exercise, okay, you actually see different changes from the Wi-Fi signals, okay. And that is what we're trying to do, right? So we are trying to classify right what types of exercise these users are doing for how long, how many repetitions you have? And if you want to go a little bit further, you can even see like changes again, as I said, if you're not like, because each individual, like iteration may not be exactly the same as the previous ones, right? So you actually want to see not only that, but the class or the category right. Of this exercise, but also you can see what is expected pattern and how users performance devastating from that, okay. This one is about kind of just basically tracking people. Okay, think about in a scenario like I'm again, I'm using Wi-Fi signals. Okay, so Wi-Fi signals are being in a way they are dependent on the movement, right in the area, okay. So if you have a user moving right in this area, performing different kinds of science, then you actually can capture the different segments and then be able to classify the types of activities. Okay, so you can see basically think about something that like a mechanism to monitor users like going about his or her daily life and then knowing that okay, now this uses cooking or watching TV or doing some other like more exercise or something, right? So by collecting the location, time and the types of activity, right information than you are able to really kind of help the user have a much better understanding of like housing go about their daily life. Another one says you can see here. So this is about gesture recognition. Hand gesture recognition. Okay, so here we have designed this bracelet. Okay, so the bracelet is of course attached to [COUGH] Your wrist, right? And then when you perform different types of like hand gestures, you're the muscle actually changed differently, okay. And that actually then can be picked up by the sensors on the bracelet, okay. So our than the data mining part, right, is then about how you take the sensor data, right? That's collected in different scenarios is a time series right across multiple sensors, okay. Then you're trying to classify, it's a classification problem, right? Because you have different types of hand gestures, you're trying to recognize. You're not okay which just so did the user just perform, right? So you want to use your marty like sensor information to the classify the specific hand gesture, okay. And here, of course, if you're using this for any real time application scenarios, then you of course you need to think about the efficiency, right? Because not only you want to have accurate classification, it needs to be efficient so that you can detect the specific gesture quickly, rather than having the user waiting for a long time to for the system to determine what it is. Okay and then that's one. So this one already kind of talked about. So there's about a group event, right? Again, it's a kind of mobile computing scenario which uses the mobile interaction data among group users to them identify, right, clusters of interest and then give to predict the likelihood of particular venue being the preferred choice for a group. All right, so here, let's look at one scenario. This about airport dispensing, right? As I have said, what air quality of course, you will build the sensors. We then read the air pollutant information, right? But here, particularly. It's about air quality is about this room level classification because air quality, right? Usually related to which room you're in like this. Usually room are more like a natural like in culture of your air pollutants. So in this case like we're trying to do this indoor room localization. So I wanted you to know which room you're in, okay. So [COUGH] what we're using is the Wi-Fi basis. So we're using Wi-Fi signals, okay? So when you have your mobile phone, your mobile phone can collect Wi-Fi signals as you go in different areas different times. So we have the raw data, right? You have the Wi-Fi signal readings, kay? Associate with different time, okay. But when indoor we cannot just use the GPS signals like the usually like they are not good reading and they really don't have the granularity we need in terms of room level. So what we do is that we take the Wi-Fi signals and we're just trying to like build a classifier, right? So the first solution we have is a supervised classification, right? So, because this is the case that I have the ground truth data, so I have the Wi-Fi signal and have the corresponding rule number. So I know this is a room one when you get those kind of particular signals, okay. And when you have all that, you're basically trying to build a classifier, okay. So there are some kind of detail design in terms of how we take the raw Wi-Fi signals and extract certain features which is stable. So we can have actually good classification results. Also we're using this temporal angles. So the idea is that we're leverage and information that you don't jump from one room to another, right? Depending on like how room adjacent and also depending your daily schedule, right? You tend to have some like frequent pattern, right? In terms of transition between rooms, okay. So that information together with the role like Wi-Fi signals allows us to have a much better classification performance. And then later on, we actually build this unsupervised room localization method, okay? So instead of having users of having to provide this gratitude label, right. They need to tell me I'm in this room and this is the Wi-Fi signal. Instead, I'm we're just collecting Wi-Fi signals, okay. Not without any labels from the user, okay. But there's a again, actually this nice classroom effect, right? You think about it, right? When the user is stationary, okay, in a certain area, then you tend to see very similar Wi-Fi signals, right? While when they move around, right, they go between places that you tend to see very different signals, okay? So then that gives us this kind of like the idea of one classroom, okay. I want you to cluster the Wi-Fi signals, okay, so when the user is a stationary I cluster his wife Wi-Fi signals, we shouldn't give me a rough cadillac of clusters, okay, of potential locations of interest. Okay, and then I use the transition signals because that means that when users moving, right. Then I know that the user is not moving, but then I have the starting and maybe the end point. So I know now the user election moving from A to B because at the very beginning the signal is very similar to cluster A and when the user stops moving the class, the signal is very similar to class to B, okay. So that didn't allow us to actually have can almost automatically identify room level information by leveraging the classroom effect and leveraging the transition information, okay. Building on top of that, we will actually able to build a four plans, okay. So this particularly get into this kind of sequence information, right? Because you think about it, right. I have signals I can roughly assigned as a room, right? But I don't know how those rooms are connected to each other, okay. So by leveraging again the transition. Think about if you're walking along the hallway, right, your signal will look similar to room 1 at the very beginning, maybe room 2, room 3, room 4, right? So as you walk along the hallway, you have a natural way of linking those components together, okay. And that then ultimately allows us to construct a floor plan in the like fully automated fashion. Okay, so next let's talk a little bit more about the Agrigento part, okay. One of course being the market model data fusion. Okay, [COUGH] so as we have said, right, in many scenarios, think about the different types of data you could leverage because they usually have some complementary capabilities, right? So you want to build to integrate them so that you have this really joint benefit, okay. All right, so one example is this is [COUGH] meet up if you know. So in meet up is online social media, right? Which actually allows a group of people to schedule like real life events, okay. So typically you would have the groups, right? So people can join different groups. The groups that may fall into different categories. So this could be food related or just spots related or travel or politics or something, right. So you have the different types of groups and then also you can get the user's information, right? Usually user have their own profile information. You may also have information regarding their previous attendance of other events or their interactions with different groups, okay. And on the other side of course, you have the specific events, right? For any event, they usually some kind of event description, right? There is of course the venue location, and then you have like who may have attended that event. So these are the RSVPs, okay? And also you can have other comments like interactions related to specific events, okay? So as you can see, right, those types of data are very different, okay? But they're all related, okay? So by combining those kind of information together, then it's easier for us than to make predictions about the likelihood of a particular user attending an event, okay. Also, in a very different setting, right? So this is about again, air quality but here we're trying to predict pm 2.5. So that's a particular matter, okay. It's like pollution. So you're basically trying to estimate the concentration levels, okay. To do that, there are different types information we can get right. Of course, we have the l air quality measures. So there's like a pollutant concentrations, right? Usually it's a kind of hourly level and it's a sparse because that depends on the monitoring stations, okay? But then you can add in add information in terms of the meteorological information, right? What is the weather like? What's the temperature? What is the wind speed with the wind direction? Because that actually can have a very important impact in terms of the concentration, right? But also you can just add in, for example, transportation network, okay. And that also is a big contributor and action factor when you're talking about air pollution, okay. So you want to look at the transportation network, you look at the traffic condition, right? The volume of traffic along each segment, okay? And then you can have like just generally 90 youth, right? So, and the population density. So all those are information related to a special area and a lot of those invention temporal, right? [COUGH] So we developed method to fuse together all those types of information, okay? So we did a little bit more like a grouping or classroom because we're trying to find attributes that are more related, okay? So this actually through some kind of correlation analysis, okay? And then we're using some deep neural network, right? So this is like Encoder decoder framework, we're also using STM, so that's a long short term memory. So these are some of the technical details which you can of course look into and explore further. But the high level idea that as a kind of data mining perspective is about how you feel together different types of data, okay, and then build the app, your model and your network right? Being able to make predictions about the future concentration, these are of course numerical values okay? Another kind of like core category of aggression design which I have mentioned in my previous lectures as well is a nominee detection, right? As we have said, a nominee detection tries to look for things that are just different, right? There probably real events that they could be arrows or mistakes or just something that's suspicious okay. So in one scenario we're looking at remote sensing data, okay, so you have seen this figure before, right. So you have remote sensing data which are giving you information like captured by certain types of sensors or Markle sensors and apparently it is like temporal right changes over time. But also it's a special because you know like the corresponding like special area when this particular values are collected, okay. So one big part actually this is about the management, right? Think about we talk about data warehousing right? This is actually one good example that where this like better management is particularly useful, okay? Because you think about when satellite collects the data. Okay, so it has more like this is slices right? It takes like so each slice is like one big area, this is one snapshot, right? Because when the satellite kind of moves over a particular area, it takes one picture right? Or one set of a sensor data, and then when it comes back sometime later and then it takes another like snapshot, right? So if you organize your data this way, so this is more like the slices slices so each file is huge, but then they are separated by time. So for you to do, like if you want to analyze how particular areas are changing, right? It's very difficult, you have to go through this huge pile of file and each one let me get this one segment which corresponded to the area that you look like and then go through time, right? So that's a very time consuming process is really not designed to support a good temporal analysis. Okay, so we started by just redesigning the management piece, okay, so we actually converted the raw data right to this what we call data rod. So each rod corresponded to particular area and then it captures all the time serious. Okay, so by doing that right, then it actually allows us to very quickly look through time to identify changes. Remember we're looking at general trends but more important about the certain changes. Okay, and that is a nominee we're trying to capture, okay? So this is really more from like the data management angle, think about what kind of analysis you want to do on top of it, okay? And this is particularly useful when you have large amounts of data because here we'll talk about terabytes or petabytes of data, okay? So changing that really like really had a significant impact in terms of speeding up the data analysis process and then as a concrete like usage scenario. Okay, so we have the satellite images, okay, so these are like on top of agreement, okay, so you have a green and ice sheet, right? You know like those white areas are mostly ice, like snow and ice, right? And then those kind of like gray or black dots correspond to lakes or melting water, right? And the data is actually allows us to track for them during the summertime. You'll see those black dots or lakes like kind of like for me and become bigger and then they will shrink back down when it gets cold again, right? So that's a natural process. But like one anomaly we're trying to look for scientists are trying to look for is the sudden disappearance of those lakes, okay, so that actually is a very important like scientific phenomenon. Okay, so you want to detect those sudden disappearance because that usually means that there is some kind of cracking underneath, right? And that actually may be very useful in terms of understanding how that changes the whole kind of like movement process of that high should, from our data like mining perspective, right? We have the original images, okay, first I need to do this kind of like detection or classification if you want to say it right on each individual image, right. Whether this is the snow ice or this is actually melting water or lake, right. So that 11 usually is one important thing is that the data is really not, it's fuzzy okay, the quality is not that good. Okay, so you really need to have a good pre processing so that you can actually have reasonably good detection like separating of the lakes and the stone ice. But the other piece that's very useful is about the temporal thing because as I said, we're looking for a nominees in time because we're looking for changes, right? That happened quickly okay, so you wanted you to say take all those images identify lakes in those images and also be able to track, right? So track meaning that you're responding like this is the same lake over time. And then you will see, okay, do I still see the lake next time when it comes back? Right, so this actually turns out to be actually very interesting, but also challenging like problem from the kind of the data mining perspective, largely because it's very huge data set, right? So that's why the management, efficient management is important, but also it's really fuzzy data set, right? So it's a remote sensing data, you may have a cloud coverage, that may be days that you just don't have any information. So your abs really need to be robust in order to identify the changes, the southern changes. Okay, so we later on have actually more kind of generalize the like approach for anomaly detection is setting, we I think we covered this briefly right? When we talk about the anomaly detection in the technique site, right. But the high level thing again is that you have special computer data, okay? And you may have different types of sensors to be attached like with particular space and the time, right. But then really think about the pre processing part, make sure the data is of recently good quality, right? But also that it gets to this part about the contextual your your model, right? Because the anomalies, right? I usually contextual anomalies, okay, so you need to look at and what kind of context and this is expected pattern and then this is how is it different? Okay, so that is the same notion about identify the context and identify the contextual nominees, okay. All right, another problem setting this morally to this kind of of like wind and solar farms, like a renewable energy angle, okay, so in many of those kind of wind or solar farms right? They usually they're pretty the massive right? But I also have a lot of sensor data right to go with it, right? Because they are capturing of course like a pretty detailed information regarding how there devices are operating. So one example this is like when the form, okay, so you get to the like a pretty detailed like temperature reading right of your multiple sensors. Okay, so from their pre processing, always do that, right? Check your data, identify potential issues and clean up your data, okay? And then once you have done the pre process and they're actually two steps right, from the data mining angle one is just unsupervised because here you don't have it. You don't like you basically just observing the sensor readings by you're trying to identify. Potential issues, okay? So here's unsupervised and you're really using some kind of classroom approach right, to identify bigger clusters. Right, those meaning usually refers to the more general pattern. Right, when the farmers operating regularly or normally, okay? But then that also maybe multiple classes. Right, but those are the normal cases, okay? And on top of that then you can identify a case that are just quite different for each other, okay? So the ones that are very different, but it's still kind of a small cluster there. And that is how you can identify them as a potential fault, okay? And then the next step is this supervised classification. Right, the first step and said unsupervised classroom, okay? So that you can identify case that may be suspicious or fault. The classification part of this is actually it goes one step further, right? So here you actually have a small amount of kind of label the data, right? They actually show. Okay, these are some good examples all maybe four different types of faults, okay? So once you have identified yeah, this is likely a fault, but then you want to be able to say which type of fault it is. Because that way from the operational angle is very useful. Right, now you don't want to just a flag saying there's something wrong. But if you can be more specific right, there's something wrong and is likely about this particular problem. Then you can actually be fundamentalists or operational and going to be very useful, okay? So that you can see here, we're talking about generally the data pre processing, but also the unsupervised class ring, right? And also don't supervise the classification. Another example, right? Solar farms. Okay, so here, of course you have the solar panels being kind of next to each other and they can really spread out right? Huge field, okay? So here similar, you can say I will look for fought or things that actually may just perform differently from others, okay? So there's one approach we have proposed. This is more of the hierarchical design. The idea is that if your next to each other you should be performing similarly right? Because you're basically condition your context, we should be very similar in terms of the sunlight to sun angle and also that's the level whatever, right? So the expectation is that within the neighborhood right? You have this maybe sit on the same panel on the same kind of TV string, combine a box, there's a certain level of similarity that we expect, okay? And when you deviate from that, then we say, okay, this is likely something that's a not malfunctioning. So that's the kind of mechanism we can leverage to then detect anominees, okay? There's actually one further step at the satellite if you here just using local information. Right, you say, okay, my local panels will be very similar to me, okay? But you can actually also identify further similar cases, okay? So this also goes back to the contextual anomaly, right? Not only just locally, there are similar but also you can look in the kind of the bigger kind of solar farm or even across the solar farms. But you're just trying to look for similar conditions, okay? So those panels may be far away, but when they are subject to similar conditions, we expect them to have similar performance, okay? And that is the high level notion. So by expanding right, across a larger area and look for specific conditions that are similar and then you look for the patterns you expect right, under that condition. Then you have a more robust way of capturing potential difference force or a nominees, okay? All right, so I think the next year I'll be talking about a little more that I focused on online social media, okay? Online social media has become very popular these days. And actually is very popular among data scientists. [LAUGH] Because you can actually use the a lot of those kind of online social media data and you can really learn a lot of things. So one example here, this is about earthquake public response. Okay, so in this particular project we're working with seismologists who study earthquakes also communication risk specialists. So they actually look at, their more social scientists, but they are really looking at how you communicate risk is and a better prepare people right? When some hazard happened. So in this case the earthquake in particular. Okay, so what do we look at? Is this 2019 Ridgecrest Earthquakes, this in California. Is particularly unique is because they had actually two main kind of earthquakes, major earthquakes, right? There's 6.44 shock and the seven one main shock. Actually there's a later one that's also pretty big, right? So because of that, right, there actually has been a lot of posts on twitter, okay? And related to this particular earthquake. So we wanted to study, right? How people responded to this particular earthquake using the twitter data. Okay, so we collect a lot of information from twitter and then we are trying to look for, let's say one is a key twitter accounts because apparently we say who are the key players, right? Are those most individual people, are those more authoritative places or news media or some or whatever. We want to just understand. Okay, so this, again, this application driven, we're just curious, right? Or we want to really understand particular pattern is related to the key players, but also when you understand that peoples emotions, right? So this is how do people feel about this, right? Are people mostly very negative or they're also some kind of positive response and what are the positive or negative about? Right, so you get to the specific topics. And also we actually look at the rumors, right? This would just say this is usually where this kind of misinformation and that actually may negatively impact the overall response. And so once you can see it's definitely that people responded very quickly, right? To the foreshock and the main shock, as you can see right, you actually see that the significant spike, right, in terms of the number of tweets that are posted, that were posted right, right after each of those scenarios. Okay, so that's just one. But also if you look at the emotions, okay? So this is a just generally in terms of sentiment analysis, right? You take the individual tweets and then you're trying to understand whether it's more negative or positive, right? So there are different scales describing the positive and the negative levels. Okay, so here, as you can see, right, this is after the first foreshock and as you can see, so here like red and blue that represents negative and a positive sentiment, right? You see the kind of mix and they have a bit of an up and down, that's fine. But what's significant about this part? Right, this is actually after the second. This is the main shock, right? The 7.01 main shock. And you can see this significant difference, right? So blue is positive, right? And the red is a negative, okay? You can see there's a lot of a significant increase in terms of negative sentiment right? In people's tweets after the second earthquake, okay? [COUGH] So this of course one can say is expected, but rather than how people responded, that really just gives you a much stronger signal when you're looking at this particular temporal information. But also how things change between the two. Well, the other thing we look at is the rumors, right? We just want to understand the general, we want to understand what people are talking about, right? Whether they're positive or negative about it. But understanding the specific topics they talk about. And particularly in this case is about the rumors, right? Because from the risk of communications perspective right, you want to understand what rumors or misinformation maybe going around so and of course you want to come up with effective mechanisms to like prevent that or reduce it. Okay, so just understanding in this case we're just trying to understand what kind of rumors are people like what people talking about right? And also for the more we talk about how they spread right in the network and now in terms of like which accounts are most kind of influential in terms of spreading that and also what kind of users right are interacting with those rumors. So all those are actually very useful information as you can see like we are dealing with a lot of data, there is a lot of pre processing and also like just modeling in different perspectives right? The temporal angle, the topic angle, the sentiment angle all those are actually a core, an interesting data mining kind of like like tasks right? You could have leverage but the end result is a much better understanding of the whole process and then we can have more like a targeted right communication strategies giving by knowing how people respond. All right, another one again this online social media, but this is about to read it. So some of you may know Reddit is also very popular online social media, but it's a different compared to twitter because twitter you should have the treats right? Much shorter Where Reddit tend to have one of course longer like post but also like they the community is quite different, right? It's less of this kind of immediate response to particular events, but they do talk about many, many different topics. Okay, so it's organized by communities, okay? And in this case particularly looking at the NBA fan communities. Okay, so if you go to like a reddish right, our NBA. So there's like a general, like a, like a called a separated, but it's a community of people, right? This capture a lot of the NBA fans and then you have the team specific communities, okay? So what do we look at? Is that one just generally like, what kind of activity do we see? So we start by looking at it maybe different teams, right? Apparently, if you're the top team versus the bottom team, right, you probably have different, like number of users and the number of posts will be different, right? So here, you can see like the top top teams apparently had a lot more activities and more members, but at the bottom teams actually also have pretty good activity, right? So there's, so that's good. But it's actually somewhere in between that where you see they're not as kind of active. And then we actually look at it like what they talk about. Okay, so we look at like whether the fans are talking about more about the current season, what this particular game vs or future looking like, how do we improve our team? So intuitively, and actually truly studies see this bottom teams, right. They tend to focus more about how we can improve our team, make your team better versus the top teams. They usually talk about excited about winning the current game or the in the current season. There's also an actor actually interest anger about the retention. Okay, So the idea is that if you talk about your fine community, right, And you're trying to maintain or grow your fan community, right? You of course want to see how much retention you have. So how many of the user actually stay from one season to the next one, right? Or from like one month to the next. So we actually can see that like apparently like the top teams, they are good to have a lot of activity, a lot of the users, but they actually tend to have lower retention. This idea is that the bottom teams, right there, fans actually are more like Royal, right? They tend to stay longer. Well the top teams, they may attract people who are more like the bandwagons, like phenomenon is that they're just following because they're winning and then once the season's done, they go away, we'll go back to their original team. So there's also more specifically like analysis can do is about when you have people moving from like one team community to another team. So that means they're switching right? Their communities, right? Where do they go? Okay, so, again, you have the raw information, you have the exact information about which members, they have a flag. They actually have, they can flag which team they're supporting, right? And also where do the post? So those are the kind of information you can leverage to don't understand how your fans or the fans of different teams are moving and how they're behaving. We then actually look at this. This also gets again, gets to move the sentiment angle, okay? So we just want to see whether people behave similarly or differently when they're in the same group versus going to other groups. Because, as you have said, like they this is a like fan community, right? NBA has many teams, right? And the teams compete right by nature of their competitive, right? So we want to understand, like I say, they want this slides showing like a their fans or users who only stay within their team community, right? This is the blue bars. But you can see a lot of them actually also go across community. That means they also post like in other teams or in the kind of like the whole NBA kind of communities. Okay, so they don't just stay in their own teams community. And then because of that we then look at whether they behave differently, right? When they are talking within their group versus like across groups. What we have found is actually this pretty significant difference, right? So the this is about there like an activity and also just like the use of the negative words. And then we're looking at the inter group of versus a single group, intergroup versus the intra group, okay? So some of the details can ignore. But this analysis allows us to compare and actually show that people tend to be much more negative when they go two other teams or other communities. So that unfortunately right is causing definitely some issue, right? So this is again just like providing some useful insights in terms of better management, right? From the fan community angle, just a sports management in general, but also from the online social media like angle just so you have a better understanding of the users of behavior, okay? All right, so far really just kind of give you of course many examples are very different, okay? Those are research projects, right? I will just emphasize again, so don't feel overwhelmed if there's like too much technical details or just like too big right a project. But my intention is really just to give you a hopefully a good set of samples okay. About how data mining can be used in a very different settings and as you're looking at those examples, right? Think about like the different aspects of data mining, right? What particular tasks of data mining may be useful for some aspect of those projects. Okay, so that is really a way of kind of help you to identify the satellite. There are different components, right? In real world data mining projects. Okay, so just see those as just examples. Okay, that's really like what I intend this like to be. So as you kind of look at this or just again, doing your brainstorming, right? We have said, all right, this is the stage of project brainstorming. Okay, so think about your initial ideas, right? Think about like how this may be visible, right? Talk to people and then iterate. So here, like you probably have some ideas already and after seeing today's lecturer and we have a few more ideas explore. These are good, right? Because that's actually I think this is the fun part, right? You want to take your time formulate your project rather than rushing to I want to make a decision now, and I'll just get ahead and write my code and do the evaluation. No, really, that will be that will be much more valuable when you have a good project design, okay? So let's take over time, think about what interests you, right. Of course keep in mind that maybe too many things you like to do, okay? So it's totally okay if you have too many choices, but then ultimately for the scope of our project, right, in this course, you want to then try to narrow down to something that is more manageable, still off interested to you, right? And also, you have this kind of prioritized the list of tasks, right, that you'd like to explore, so that will help you to maintain the scope of your project. Okay, so we said, I'd like to end today's lecture and keep thinking about your project ideas, and we'll talk more about the actual proposal next time.