So the human genome project
culminated in the first complete sequence of a human genome,
an enormous book with three billion characters, ACGTs in succession.
Of course, the main idea was to identify
especially the functionally important elements in this genome,
the genes and their genes switches.
When we annotate a new genome,
it's a good idea to start by simplifying your life by masking the repetitive sequences.
And there are excellent tools to do that
especially RepeatMasker and its associated database, Repbase.
Once we have done that, we can,
for instance, target the genes.
And what has probably yielded the most information when it
comes to gene annotation is the exploitation of
their key feature that is genes are transcribed
into RNA copies in at least one of the cells of our bodies.
So we are able to sequence the transcriptome,
that is the entire collection of RNA molecules in different cells,
and transcriptome characterization and sequencing has rapidly evolved over time.
It went from Sanger sequencing cDNA and EST libraries to
more recently massive parallel or next generation sequencing of stranded RNA libraries,
small RNA libraries, or tags such as SAGE and
CAGE tags which are targeting the three
prime and the five prime end respectively of genes.
All these sequences, all these reads,
corresponding to RNA molecules can be mapped back to the human reference genome, thereby,
by definition, identifying non-coding and coding genes,
and in fact, defining exon and intron structure.
One slight problem I would say with this approach
has been the observation of so-called pervasive transcription.
It is becoming increasingly apparent that not everything that is transcribed,
in fact, corresponds to a gene.
As an example, distant regulatory elements
seem to generate small transcripts when they are active.
But overall, the mapping back of sequences
corresponding to RNA is probably the most effective way to identify genes.
I should nevertheless briefly mention the fact that bioinformaticians
have developed so-called ab initio gene prediction software.
These programs exploit distinctive sequence features of genes and allow you to
directly identify quite effectively at least the protein coding genes in your genome.
But let's now focus on the gene switches.
They're obviously extremely important functional elements and
identifying them is a bit more complicated than the genes.
You may remember that the human genome project not
only meant sequencing the human genome,
but also the mouse genome and after that the dog genome and after that the bovine genome.
At the present time, there are maybe 30 mammalian genomes that have been obtained.
Why was that an integral part of the human genome project?
Well, the purpose from the onset was to use
these comparative genomics information to
identify so-called evolutionarily constrained elements.
If we compare the genomes of these different species,
there are parts of these genomes that are completely different between different species.
This typically corresponds to regions of the genome that
can evolve fairly rapidly without affecting
the fitness of the individuals within these species
because they don't carry functionally very important elements,
at least not conserved ones.
On the other hand, there are parts of the genome that seem not to be able to evolve.
They seem to be frozen, constrained.
This is because mutations in these parts of the genome are deleterious.
They affect the fitness and therefore,
they don't get fixed in the populations.
It's like if these pieces of the genome cannot evolve.
So if you align the genome of multiple species,
you look at the parts that are the same across all the species,
you virtually know by certainty having sequenced
29 species that this must be diagnostic of a functionally important element.
And these can, for instance, be used to identify gene switches.
More than 1 million evolutionary constraint elements
have been identified in the mammalian genome using this approach.
So this is a very powerful and efficient approach,
but, of course, it can only identify conserved switches.
We might be interested in identifying
the switches that are unique to primates if not humans.
Of course, they're not going to be shared with the bovine.
So do we have other methods that we could apply to more
directly identify gene switches? Well, there are.
And I'll mention four different biochemical approaches.
The first one is so-called ChIP-seq for
chromatin immunoprecipitation combined with
next generation sequencing targeting transcription factors.
So imagine that you know a transcription factor very well,
you even have an antibody that is specific for that transcription factor.
You may want to identify
all the genes switches to which this transcription factor binds in a given cell type.
How do you do that?
You take the cell type of interest, a sample of it,
you freeze the chromatin by exposing it to formaldehyde,
you break down this frozen chromatin in pieces,
and you incubate the resulting solution with
an antibody that will specifically recognize your transcription factor.
This would allow you to so-called immunoprecipitate these parts of the chromatin.
With the transcription, because it's frozen,
comes the DNA that is defining the distance regulatory elements
of interest and multiple ones because
the transcription factor will bind across the genome.
What you do then is you reverse the cross-linking,
and you release the DNA that was co-immunoprecipitated with the transcription factor.
You put that in next generation sequencers and you will obtain reads,
sequence reads, corresponding to the distant regulatory elements.
If you map these sequences back to the reference human genome,
by definition, you identify the distant regulatory elements that you were after.
So this is, of course,
a very nice and effective approach,
but it can only work for transcription factors that are
well-known and for which very good antibodies are available.
And this is certainly not the case for all transcription factors.
There are alternative methods that I will refer to as
generic methods because they don't require that information.
One is based also on ChIP-seq,
but the antibodies that you use do not target transcription factors,
but rather the post-translational modifications
of the amino terminus of histones that will
allow us by means of the histone code to
recognize the enhancers, silencers and promoters.
So we will take the same tissue as we did before.
But we will now perform ChIP-seq experiments
sequentially using antibodies that recognize specific histone modifications.
We then put all the information back together using, for instance,
a Hidden Markov Model and can identify active enhancers,
poised enhancers, silenced enhancers, et cetera.
We can identify it without knowing
the transcription factor these distant regulatory elements.
Another genetic methods is not based on immunoprecipitation,
but will rather exploit a feature that is typical of active switches.
That is the corresponding chromatin is relatively open.
It's less condensed than the rest of the genome.
And as a consequence,
it becomes more easily accessible to nucleases to which you might expose the chromatin or
to in vitro generated transposons that
will preferentially integrate where the chromatin is open.
DNase-Seq is the name of the first method and
ATAC-Seq based on transposons is the name of the second method.
Combined with next generation sequencing,
they allow you to identify
gene switches without knowing the transcription factors that bind to them either.
Well, this is an advantage and sometimes a disadvantage.
Once you have identified these switches in a generic way,
you may a posteriori like to answer the question which
transcription factor might bind to the things that I have identified.
In fact, if we look in detail at the result of DNase-Seq experiments,
we will see that the experiments generates so-called footprinting patterns.
So if we look at the piles of reads that are defining these gene switches,
we will see that not all the bases are equally covered.
This is because the protein factor binds specifically to parts of the DNA which are,
therefore, protected from DNase.
And if we look at these footprints,
we can actually identify transcription factor binding
sites that will allow us to go back to the transcription factor a posteriori.
The last method that I would like to refer to is
chromatin conformation capture which can either be done in a targeted way,
but is increasingly done at the genome-wide level.
It is then called Hi-C seq.
The idea there is that one will try to identify pieces
of DNA that although far away on the chromosome,
are brought close to each other in the nucleus by virtue of
this looping structure that is formed when a gene switch is active in a given cell type.
Hi-C allows you to identify two pieces of
DNA that are brought close to each other by this process in the nucleus.
Combined with so-called promoter capture Hi-C,
it is a very effective new way to identify
distant regulatory elements controlling specific proximal promoters.