This module covers qualities of the data engineering solution beyond functionality and performance. It addresses reliability, policies, and security. Let's begin with reliability. Reliable means that the service produces consistent outputs and operates as expected. If we were to quantify it, it would be a measure of how long the service performs its intended function. Available and durable are real-world values and they're usually not a 100 percent. Available means that the service is accessible on demand, a measure of the percentage of time that the item is in an operable state. Durable has to do with data loss. It means the data does not disappear and information is not lost over time. More accurately, it's a measure of the rate at which data is lost. These qualities are related. If a service fails, if it has an outage, then it's not producing reliable results during that period. An alternate service or failover might bring the service back online and make it available again. Typically, an outage that causes a loss of data requires more time to recover if it's recovered from backup or from a disaster recovery plan. But notice that if you have an alternate service such as a copy that can be rapidly turned on, there might be little or no loss of data or time to recover. So, the important thing to consider is what are the business requirements to recover from different kinds of problems and how much time is allowed for each kind of recovery? For example, disaster recovery of a week might be acceptable for flood damage to a store front. On the other hand, loss of a financial transaction might be completely unacceptable. So, the transaction itself needs to be atomic, backed up and redundant. Simply scaling up may improve reliability. If the solution is designed to be fault-tolerant, increasing scale might improve reliability. In this example, if the service is running on one node and that node goes out, the service is 100 percent down. On the other hand, if the service has scaled up and is running on nine nodes and one goes out, the service is only 11 percent down. The next section of the exam guide refers to performing quality control. This is part of the reliability section of the exam guide. It's referring to how you can monitor the quality of your solution. Integrated monitoring across services can simplify the activity of monitoring a solution. You can get graphs for multiple values in a single dashboard. It's possible to surface application values as custom metrics in Stackdriver. These charts show slot utilization, slots available and queries in flight for a one-hour period of BigQuery. The exam tip here is that you can monitor infrastructure and data services with Stackdriver. TensorBoard is a collection of visualization tools designed specifically to help you visualize TensorFlow. TensorFlow graph, plot quantitative metrics, pass and graph additional data. Events at the top left shows loss. The other graph show the linear model graph is built by TensorFlow. The exam tip here is that service specific monitoring may be available. TensorBoard is an example of monitoring tailored to TensorFlow. Here's some tips for reliability with machine learning. There are a number of things you can do to improve reliability. For example, you can recognize machine failures, create checkpoint files, and recover from failures. You can also control how often evaluation occurs to make the overall process more efficient. The tip shown is that in TensorFlow data is often divided into training and evaluation sets, defining a path for measuring effectiveness and for improvement. So, the overall exam tip is that there might be quality processes or reliability processes built into the technology such as this demonstrates.