Welcome back. So now,
we're going to talk a little bit about replication.
Because if you have a distributed database,
sometimes and after you do fragmentation you have multiple fragments,
you may need to replicate some of the fragments
among different sites and the distributed database.
So, the question is why replication.
So, replication there, there are many reasons you may
need to replicate some of the data fragments you have.
First of all, is increased availability.
If you replicate the data fragments,
you increase the availability of your system.
Let's take an example,
if one site goes down,
for some reason the machine is,
the power like goes off,
or anything like this in that site, so,
that means the data fragment that exist on that site,
if you need to access it at the application you will be not able to,
if it's not replicated.
But if you have multiple replicas of this data fragment,
you may be able to access it at a different site where it's replicated.
So, that's one example of how it increases availability.
Also the increased availability it's
not only because of the problem that might happen in the sites,
sites might be okay, but,
if you don't have replicas of the same data fragment, for example,
and this data fragment there are
so many accesses from the application to that data fragment.
At some point every site can handle
only a limited number of requests for that data fragment,
and if you reach that limit the system will not be available for more requests.
In that case, if the data fragment is replicated,
you can actually route the request to the replica that exist on a different site,
and hence you will be more available that way,
the system will be more available, that's another example.
As a by-product of availability also,
or you can also have faster query evaluation.
We talked about the the site can not handle more than x requests,
and if there are more than x requests,
you will have to put them in a queue,
for example, and wait till you service the existing requests.
This will make the query processing slower.
But, if you have replicated versions of the same data,
you can handle the request faster,
because you have one request on one replica,
the second request on a different replica,
and they all can run at the same time because the data is replicated.
That's also one aspect.
And an example of that, again,
I always like to use the Facebook example.
So, in the Facebook app, because Facebook is
an application that runs on top of the database,
so, for availability yes,
every user expects the system to be available all the time.
You will not be happy,
if you log into your Facebook,
and it just the system doesn't
log into your profile and says it's not available, you will not be happy.
And this rarely happens on Facebook's,
like every once in a while you will try to access
Facebook and it doesn't work for a minute or something like this,
but that's very rare that it happens these days,
because they have good replication,
because they have good availability for the system.
Also, you're not be happy,
if you log into Facebook,
and you have to wait for long time to access your friends news feed.
So, in that case,
Facebook might be doing some sort of replication for faster query evaluation.
It might replicate your friend's information on one site,
on another site also, so,
that's all people that are friends of this person can access the data faster.
However, these are all benefits of replication.
There is a huge challenge with replicating data which is,
if you have updates to the data.
So, if you have multiple replicas of the same data fragment,
and you want to update the data fragment,
this means you have to update the replicas as well,
because these replicas are a kind of the old tool like copies of each other.
You need to update it, otherwise,
there is inconsistency between the replicas.
So, updating three replicas will take more time,
so updates will be slower.
Also, if you have a lot of updates and a lot of transactions coming,
that also might be challenging.
How to do concurrency control on multiple updates and
multiple transactions running on replicated data fragments.
So, another reason why replication
improves the performance stats,
sometimes you replicate the data,
and as we give the Facebook example,
you can put the data close to where it's accessed.
So, you replicate fragments again,
I replicate friends information on Facebook,
friends' news feed, and post,
and everything on Facebook,
and put it close to where it's accessed,
put it close to where people really log in and see this information.
So, Facebook have multiple servers, one in Phoenix,
one in New York,
another one in in Beijing in China, so,
it replicates the information,
and puts it where it's accessed the most,
and this way close to where it's accessed so that it's
faster retrieval over the network of the distributed database.
So, in that case you will need fragmentation,
fragment the data, you will need replication also sometimes.
So, if you have
theoretical alternatives of replications that you can totally have a partition database,
no replication whatsoever, so,
it's a non-replicated database.
We call it partitioned because each fragment resides at only one site.
So, we call it a partition sometimes,
we call it a sharded database,
or sharded deployment of the database.
So, there's no replication.
So, you have each fragment exist on one site, only one site.
So, here we have F one, F two,
F three, F four.
This is like site one, two, three, four.
A fully replicated means that each fragment is replicated at every single site.
So, site one, you have here site one,
and it has F1, F2, F3, F4.
Site two is the same, site three, and site four.
So, this is a fully replicated database.
And you have a partially replicated database.
And in that case, each fragment can be replicated at some of the sites.
And most of the databases these days,
distributed databases, they follow a partial replication model.
So, not even a distributed database.
We're talking also about big data systems like Hadoop,
and Spark, and all these kind of new systems.
They have some replication factor,
and it's a partial replication.
You might decide, for example, I'm going to replicate that fragment
among three different machines or three different sites across the network,
which is considered enough for higher availability,
or higher reliability, for example.
So, a rule of thumb,
this is a very simple rule,
and not necessarily, there is more details to it.
But, if you have a lot of queries as compared to updates,
replication is definitely good for higher availability,
better performance and everything.
Otherwise, replication may cause some problems.
But you can still do it,
if you can overcome these problems,
and they are problems as we said,
are mainly with updates on replicas,
updating replicas, and a relevant problem is
transaction processing and concurrency control
on replicated database.
That's also another problem.
So, if you can fix these, and there are already some mechanisms,
some approaches to how to update replicas,
or how to achieve concurrency control on the replicated data fragments,
that exist in the literature.
But it is still,
till today, it's a design problem,
it's an issue with replicated database,
and there are so many ways to solve it,
and there's no single solution that fits all scenarios.
Like after you decide you have fragments,
you need to decide how to allocate the fragments to different sites,
as we mentioned, and whether to replicate or not, as we mentioned also.
So, the problem here is that you have fragments,
you've done horizontal vertical fragmentation,
derived horizontal fragment, whatever kind of fragmentation you came up with,
and you have network sites.
These sites are like physical sites.
And you have queries, and the queries are the application again.
On, like in a Facebook application, for example,
one query is that "retrieve all the news from my friends ".
That's a query, and it's an application.
You have multiple of these queries that can run on a Facebook application, for example.
And the idea here is that given this application, and fragments,
and the sites, find
the optimal distribution of the fragments to the sites, physical distribution.
How can you allocate the fragments to the sites.
The problem is actually a very hard problem,
because you need the optimal solution,
you need to minimize the cost,
and increase the performance for all applications from all the queries.
And the performance is through response time and throughput.
So, a lot of questions may arise,
if you want to do really,
to get an optimal solution.
So, first of all, where do queries originate from?
So, do you have a lot of people log in to Facebook from China,
or from the US, or from another country?
Or even from another city?
From the city kind of perspective.
That's one thing.
What is the communication cost?
Which is like classes you might have.
Facebook has different distributed system,
distributed database system across the globe, so,
what is the cost of transferring the data,
the fragments from the server that exists,
or the machine that exists in China to the machine that exists in the US.
Again, I'm simplifying the problem.
There are more to it than that,
but this is just what we mean by communication cost.
And this will affect the performance of the database.
What is the storage capacity and cost at sites?
Like, for example, every site has its own machine or machines kind of specs,
and the machine specs can be like,
how much storage should this machine have?
How much CPU? How much memory that this machine,
or these cluster machines in this site has?
So, all of these issues needs to be taken into account,
because it affects the performance also.
The processing power at each site also is different.
So, how can you account for that?
The query processing strategy at every site can also be a little different.
And the optimizer or the allocation problem,
allocation algorithm needs to take that into account.
Do we have replication?
As we mentioned already,
this is also another aspect that the allocation approach needs to take into account.
What is the update cost of each copy if you have replication?
That's also one thing. How can you achieve concurrency control?
Also that's another issue.
So, all these issues,
the allocation problem needs to take into
account before deciding which fragment to assign to which site.
And again, as you can see,
it's an exhaustive list of things,
and practically speaking, no allocation algorithm actually can handle all of these,
or can take into account all of that.
So, most of the time there are heuristics of how to allocate,
or fragments to the sites.
And this is what actually real data use,
or practical distributed database systems use.
So, eventually you want to minimize the query response time for each query,
maximize the throughput, if you have updates to the data and queries,
and you want to update, maximize the throughput that you have, and the cost,
the cost here in terms of, what kind of resources you use to answer these queries,
or handle the requests coming to the system.
And again, these are the constraints we talked about,
like the storage or response time, the bandwidth,
the power that you have in the system,
all these kind of things needs to be taken into account.
So, this is just to give you a flavor of how allocating fragments to sites,
for a specific application,
can be a very challenging, very hard problem.