As I decide to continue to use my relational database systems to handle big data, I'm going to have to figure out how can I do so relying on techniques like replication, parallelization and sharding. Replication really just means I'm making a copy of my database and I it's almost like a mirror image where I can take a full copy of my database from one place and create another copy or replica over here on another place. So we'll be talking about that parallelization, just means that I'm taking my compute work and spreading it out over multiple nodes in the cluster. So of course, if I have a compute task like running a large query, if I run it on one server by itself, it's going to take a certain amount of time. If I can spread that query work across multiple server nodes in a cluster, it's going to take less time and run faster. So that's a technique that I can use to help me handle big data with my relational systems and I can do sharding, which comes from the term shard, which is a common term in the world of archaeology and anthropology. Where scientists study ancient civilizations and they learn about those civilizations by looking at shards from pottery. Can you imagine that? So that's where the term comes from. If a clay pot has been broken into many small pieces, those are pottery shards and scientists can study them to learn about ancient civilizations. Well, in this case we're going to take our database and break it into a lot of small pieces. So sometimes the shards are often referred to as partitions. So I may use that term as we describe, taking my data, breaking it into partitions and spreading those partitions out across multiple nodes in the cluster. If I have done sharding and replication to spread my database out across many nodes in a cluster, I can run my compute work in parallel. When I'm using replication in certain database systems like my SQL, for example, if I create my SQL database and decide to do clustering with my SQL, my SQL defines one server node with a special role called the watcher. So in this example, suppose I've got some application software that wants to do work against my database that is spread out across multiple nodes in a cluster, okay? That application software sends its database calls through the watcher, which is really like a traffic manager, a coordinator. And the watcher then sends query requests to different nodes in the cluster, depending on what is that query trying to accomplish and where is the data. So take a look at this little diagram here, I've got an application that is talking to the database through a special traffic manager node called the watcher. And the watcher remains aware of the health of the cluster and can send traffic to different nodes in the cluster. So in this case I've replicated my primary database using the log files, creating a replica over here, which is a secondary copy of my database. If something happens to my primary database server, the watcher becomes aware of that and can send all query traffic to the secondary copy of the database, which means I have no downtime, no single point of failure. So that's the role of the watcher. It provides high availability and reduces any single points of failure. So it allows my replicated database, which is spread out across many nodes in the cluster to stay up and running. Here's another example of a similar architecture where I've implemented sharding to distribute data partitions across nodes in the cluster. So in this case I've got a four node cluster. Each cluster has its own partition of the data from my database. So I can send all of the rows in my database with key values from one to 100 to this copy of the database and then another partition onto node number two where I'm pushing rows that have key values from 101 to 200, etc. 201 to 303 and 301 to 399. So my database could consist of one table that I have spread across four partitions in my cluster. Now, when I do that, the watcher needs to be smart enough to know what data is where and to send query requests to the right node in the cluster accordingly. And as I do this, it supports my goal of running things in parallel so that I could have a query go out and run on each of these server nodes against a subset of the data or a partition and return results faster. So that's an example of how sharding can help me continue to use my relational database systems and handle big data. I can combine sharding and replication so that in this example I've got node number one containing key values 1-100. And another replica of that data containing the same shard or same partition where my database shard number one or partition number one exists on both Node 1 and Node 3. So I've got redundancy in my charting and what does that give me? It gives me even more parallelization and high availability. As a matter of fact, we did talk about database backups and how the database will use transaction logs to restore a database from a backup copy. If I use sharding and replication like this to spread my data out across many nodes in the cluster, I don't have to back up my database because I've got redundant copies. So in this scenario, if I lose node number one, I don't have to do a database recovery because I've got a full copy of that partition here on node number three. Thus reducing the need for me to do database backups and it keeps my database up and running all the time, high availability. There Are really two modes of replication I want to mention to you, what you just saw as an example of primary to secondary replication used to be called master slave replication. And another example is peer to peer. So I want to differentiate these for you just a little bit. If I'm doing primary to secondary replication or master to slave all the updates to my database have to go to the primary node and then the read activity, that's not changing the database in any way that can go against any node. So I can read from any node but all my updates have to go to the primary node. So what are the advantages of primary to secondary replication? Well, it gives me good read scalability so that I can add more secondary nodes and spread out the read work against my database and with all the updates going to the primary node only, I can guarantee update isolation. So I don't have the risk of those read anomalies that we learned about in our lesson on concurrency. The disadvantage of this is that if I've only got one primary node, my database capacity is going to be constrained by that nodes capacity, so I'm limiting myself. An alternative is to do peer-to-peer replication where both my UPDATES and my READS can go to any node and this gives me advantages of good scalability. I can just add more notes to my cluster and I spread the work out and get things done faster. And if I'm doing peer-to-peer replication, it provides very robust, high availability in case of a node failure. So if I have peer-to-peer replication and one of my primary nodes goes down, no problem. I've got that data stored elsewhere on another note, the disadvantage of peer-to-peer replication is that it's pretty complex and it runs into different problems that are caused by trying to do update isolation. But imagine if I've got a server node here and a user that's updating that and I've got another server node here and another users updating the exact same row on two users updating the same stuff on two different nodes in the cluster. Those clusters have got to communicate with each other and say, hey, I'm updating this row and hey I'm updating this row. So the management of the concurrency and the possible situations that can arise when I've got multiple concurrent updates going on, that gets a lot more complicated when I'm doing peer-to-peer replication. Because it's difficult to guarantee update isolation yet peer-to-peer replication solves a lot of other replication issues. So it becomes a decision point for database technology specialists to decide, okay, what replication are we going to set up? Another thing to think about with replication is that I might have my database cluster spread into different data centers that are geographically separate from each other. And if I do that, I can have some advantage of having geographically separate data centers so that if one goes down the other one is still available. The work that must be done to keep those two geographically separated clusters in sync with each other. It's going to be delayed by network latency because it takes time for data to move across networks. So there's a tradeoff here where as I decide to take advantage of geographic separation of the nodes in my cluster. I've got to understand that as I try to maintain data consistency across the geographic separation, I'm going to be reducing my processing throughput. And so cluster designers have to face a decision about do I want to slow things down and ensure data consistency or do I want to speed things up at the sacrifice of data consistency? So it's just another part of this relational problem. So here's a picture that illustrates that. So take a look at this architecture here where I've got database that spread out onto four nodes in a cluster and node one and node two are in the same data center on the West coast of the United States. Working with each other over a local network and Node 3 and 4 are on the east coast at a data center and working with each other across a local network. But these clusters, the nodes in the cluster should be able to talk to each other over a wide area network so that this partition is kept in sync with this partition across a 3000 mile network, okay? So you can imagine it takes time if I update shard one on Node 1, that update needs to be replicated over to Node 3 across a wide area network. There's going to be network latency that causes a very slight delay in that update. So it's possible if I choose to sacrifice data consistency for the sake of faster throughput, it's possible that there are rows in these two partitions that should be in sync but they're not. And so one user of the application could issue a query against Node 1 where a row has already been updated and another user could issue a similar request that the watcher will send to Node 3. Because it's a closer data center and that role has not yet been updated because of the network latency. So what that shows is the trade off between data consistency and fast performance and throughput. So it's a situation that database designers setting up nodes in a cluster need to be aware of that tradeoff. So my solution to solving the relational problem, if I choose to continue to use my relational database systems to try to handle big data. I can do that through clustering and through sharding and through replication or I can opt to abandon my relational database system. And adopt a whole new database that we call NoSQL, which means by the way, not only SQL, it doesn't mean no SQL, it means not only SQL. So this is a decision that I face as I try to solve the relational problem. Now, in our next video, we're going to be taking a much deeper look into the NoSQL systems that are available. So just keep that in mind. So what are the pros and cons of deciding to continue to do relational database software with my big data? It's a tough decision that database architects have to decide. So I can keep using my relational systems. I can leverage the investment I've already made and all that software. I can leverage the knowledge of my existing staff. I don't have to rewrite all of my software. I can expand my clusters horizontally. I can use the cloud and build my clustered network of nodes in the cloud. So I don't have to host all those servers in my own data center. I can take advantage of replication and sharding. I can take advantage of parallelization and if I need to, I can Relax ACID compliance for faster throughput. So those are some of the things I have to think about is I'm trying to make this decision on the other hand, I could adopt a new NoSQL solution that may be designed to handle unstructured data very well. But if I do that, I might have to retrain or even replace my staff and I might have to rewrite all of my application code that was written to use relational and now I'm trying to get it to use NoSQL. Well, I'm going to have to go code and test all of my application software, but the NoSQL systems that we're going to look at in the next lesson. They all use replication, sharding and parallelization that's why they're so fast. They typically all Relax ACID compliance and so they don't worry about transactions that way. And they typically opt for speed over data consistency. So it becomes a very interesting choice for database architects. Database designers to make that decision to solve this relational problem. Do I keep my relational systems or do I opt for NoSQL? So let's take a deeper look at NoSQL in our next lesson. Thank you. We'll talk to you next time. Bye, bye.