Let's look at some transformative changes.
First, fast random access.
Many programming languages, are object-oriented.
Which means, much of the software that you write, deals with objects.
The good thing about objects is that they're hierarchical.
Let's say for example,
that in your program you have a hierarchical data structure like a player.
So you have players who are either footballers or cricketers,
and cricketers not all of whom bat or bowl.
So you basically have some cricketers have bowling averages, but everyone bats.
So, all cricketers have batting averages.
So you have this hierarchy,
and you want to represent this in
a relational database because you want to persist the data.
So if you use a relational database to persist an object hierarchy,
you get into an object relational impedance.
There is a mismatch here.
For example, we said that,
all cricketers have batting averages,
but not all of them have bowling averages.
However, the table itself if you go look at it,
there'll be a name, there'll be a club,
there'll be a batting average,
there'll be a bowling average.
What is a batting average of a football player?
Makes absolutely no sense.
But we have to have it.
So we'd basically go ahead and put in a null in there.
Once you do that,
you basically have data integrity problems from this point onwards.
So how do you prevent this kind of an issue with an object to relational mapping?
Well, one way is that you can store objects directly.
That's what Cloud Datastore on GCP lets you do.
So, Datastore, scales up to terabytes of data.
In a relational database typically goes into a few gigabytes.
Datastore can go up to terabytes.
What you're storing in the Datastore conceptually,
is like a HashMap.
So it's a key,
or an id to an object.
So it stored the entire object in.
So when you are writing to Datastore,
you're writing an entire object.
But when you are reading from it that searching,
you can search by the key but you can also search by a property.
So you can look for all cricket players,
whose batting average is greater than 30 runs a game.
Right. You can basically do this by
taking one of those indexed fields which we will look at shortly.
You want to update again you can update just the batting average of a player.
You can update this in a transactional way.
So Datastore supports transactions.
It allows you to read and write structured data.
It is a pretty good replacement for
used cases in your application where you may be
using a relational database to persist your data.
However, this replacement, is something that you would have to do explicitly.
Unlike the things that we talked about in the previous chapter.
For example, in the previous chapter,
we said you have a Spark program that you're running on a Hadoop cluster On-Premise.
You want to run it on GCP,
just run it on Dataproc.
Pretty much all of your code just migrates unchanged.
If you have a MySQL code base,
well whatever you're doing to your MySQL On-Premise you can do
to MySQL on Google Cloud using Cloud SQL.
Those are easy migrations.
Right. Take what you have.
Take those used cases that you have just moves up to the cloud.
But when we talk about something like Datastore,
now it's not that easier migration.
You have to change the code that you're doing,
where the way you interact with Datastore is
different from the way you'd interact with a relational database.
So how do you interact with Datastore?
Well, the way you work with Datastore,
is that it's like a persistent HashMap.
So for example, let's say we want to persist objects that are author objects.
You'd say I have a author class,
it's' an entity, that's the entity.
It's an annotation that you add and I'm showing you Java
here but it works with a variety of object-oriented languages.
You say that the author is distinguished by their e-mail address.
The e-mail addresses on id column,
so you say at id.
We want to search for authors by name.
So we'd like the name property to be indexed.
Just to show you that you can have has-a relationships,
an author has a bunch of different interests.
Same thing about guestbook entries.
You store guestbook entries,
each entry has an id that makes it unique.
It basically has a parent key and author.
These are the people who wrote the guestbook entry and
that's the relationship, you have messages,
we're never going to search apparently because it's not indexed,
we're now going to search for guestbook entries based on
the text of the message, and we have dates.
Right. That's something that we might want to search based on.
So once you have an entity,
you have an author, there's an entity,
an author has an e-mail,
which is the id and a name which is the index,
we want to create an author.
You basically call the constructor.
Just as you would do for any Plain Old Java Object.
So new Author xjin@bu.edu. Name is Ha Jin.
You'll have your author object.
But at this point, the author object is only in memory.
You want to save it.
Right. You basically call save,
passing in the entity.
Ofy here is the objective file library.
It's one of several Java libraries that help you deal with Datastore.
So in this case, this code is showing you objectify,
we save the entity,
and at this point,
the xjin object has been persisted.
If you want to read it, right?
If you want to search for it,
what you can do is load all authors and filter them by name Ha Jin.
Because name is an indexed field, we can do this,
we can filter by name Ha Jin and we will get back an iterable of authors.
Why iterable and not a list of authors?
Well, because Datastore scales up to terabytes.
So one of these columns that you're searching based on,
what comes back could be gigabytes of data might be much more than can fit into memory.
So, we give you back an iterable.
Well, if you know you're going to get back only one item,
such that's the second one here you're loading authors,
and you're finding id xjin@bu.edu.
At that point, you're going to get one author back,
and so you basically get back the author object,
we call it jh within the code.
Now, we can update the name of jh.
You can jh.name name equals Jin Xuefei,
then save that entity.
At this point now,
we basically have a newly persisted object.
The object that's persisted basically has a new name.
Then if you want to delete the entity,
we just say delete entity jh.
So create, read, update, delete.
You can pretty much do everything that you do in a relational database,
in a transactional way using Datastore.
Another access pattern, so these are again options to using a relational database.
You could use Datastore,
if you'd need transactional support for
hierarchical data something that relational databases don't handle very well.
Another reason that a relational database may not work very well and we discussed
this in the module review section of the previous chapter,
is if you have high-throughput needs.
If you have sensors,
that are distributed all across the world,
and you're basically getting back millions of messages a minute,
that's not something that Cloud SQL can handle very well.
That's not something that a relational database can handle very well.
That's essentially, an append-only operation.
We're just getting your data and we're saving it.
Right? We don't need transactional support.
Because we are willing to give up transactional support,
the capacity of Bigtable is no longer like
the terabytes a Datastore can support, but petabyte.
On the other hand, what we've given up,
is the ability to update just a single field of the object.
We have to write,
an entirely new role.
The idea is, that if we get a new object,
we basically append it to the table,
and then we read from latest data and go backwards,
so that the very first object that we find that the particular key,
is the latest version of that object.
So Bigtable is really good,
for high-throughput scenarios where you want to
be not in the business of managing infrastructure.
You want something to be as no-ops as possible.
With Bigtable, you basically deal with flattened data.
So it's not for hierarchical data.
It's flattened.
You search only based on the key.
Because you can search only based on the key,
the key itself and the way you design it, becomes extremely important.
Number one, you want to think about the key,
as being the query that you're going to make.
Because again, you can only search based on the key.
You cannot search based on any properties.
Because you can't search fast based on properties,
you're going to be searching based on keys,
you want your key itself to be designed such that,
you can find the stuff that you want quickly.
The key should be designed such that there are no hot spots.
You don't want all of your objects,
all of your rows,
falling into the same bucket.
You want things to be distributed there.
Tables themselves should be tall and narrow.
Tall because you keep appending to it.
Narrow, why?
The idea being that if you have Boolean flags for example,
rather than have a column for each flag,
and have the value be zero or one.
Maybe you just have a column that says these are the only true flags on this object.
This kind of thing becomes extremely useful if you're
trying to store for example, user's ratings.
The user may rate only like five out of the thousands of items in your catalog,
and rather than have thousands of columns one for every item,
you simply store,
object comma rating for
the things that they have actually rated and that could be a new column.
Even though we said that your columns have to be flattened,
there is this concept of a column family.
So you can basically say for example,
here MD is market data,
so an MD:SYMBOL, MD:LASTSALE.
This is a way to basically group together related columns.
The reason to use Bigtable is that it's no-ops,
it's automatically balanced, it's automatically replicated, it's compacted.
It essentially no ops.
You don't have to manage any of that infrastructure.
You can deal with extremely high-throughput data.
This is how you work with Bigtable.
You work with it using the hbase API.
So, that's why what you're importing as org.apache.hadoop.hbase.
So you work with it the way you would normally work with hbase.
You basically get a connection,
you basically go to the connection and you get your table.
You create a put operation,
you add all of your columns and then you put
that into the table and you've basically added a new row to the table.
If you're familiar with hbase,
it's exactly the same way.