Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods".
I am Ge Gao from the Center for Bioinformatics, Peking University.
Let's continue this week's topic: Sequence Database Search.
Last time we have illustrated with an example on how to use BLAST
to search databases for similar sequences of the query sequence
within a selected species or across species.
This can facilitate the study of the function and evolution of the query sequence.
In this unit, we will introduce the ideas behind the BLAST algorithm.
As mentioned before, BLAST was first published in 1990.
According to Google Scholar, its two main papers have been cited nearly 50,000 times each.
It is one of the most widely used and appreciated algorithms in bioinformatics.
So far there have been more than 30 different toolkits developed for BLAST.
Many papers have been written on how to use BLAST.
There are even books devoted to this algorithm.
Due to time constraints, we cannot cover every single aspect of BLAST here today.
We will only introduce its basic ideas and algorithms.
Students who are interested can get further information from the Readings section.
As mentioned before, in dynamic programming, the key to reduce the computational time is
to reduce the number of cells to fill in as much as possible.
Specifically, we should try to compute only those cells in or surrounding the optimal alignment path.
BLAST is an extension of this idea.
Specifically, BLAST will first find highly similar small segments between the two sequences,
called the “seeds”.
These seeds will then be extended outward to construct the alignments.
Finally,BLAST will compute the statistical significance for this alignment
in order to reduce false positives.
Specifically, BLAST will first slice the query sequence into multiple small segments,
called “seed words”.
Then, BLAST will search a pre-built index table of all seed words in the sequences in the database
Then, BLAST will search a pre-built index table of all seed words in the sequences in the database
to find the query sequence’s seed words which signifies candidate matches between
the query sequence and sequences in the database.
Students with background in computer sciences may realize immediately that
it would take only linear time (or even near constant time) to search an index table.
This will improve the computational efficiency considerably.
By locating candidates for every seed,
we can get a hit map between the query sequence and candidate match sequences in the database.
As discussed in the previous unit, the optimal alignment path should be parallel to the main diagonal.
Therefore, we can discard those isolated hits.
We keep only the “hit clusters” each of which has two or more successive hits parallel to the main diagonal.
This will reduce the search space further.
Then for each hit cluster, we can extend outward from the two endpoints
until the total score has decreased below a given value X.
Then for each hit cluster, we can extend outward from the two endpoints
until the total score has decreased below a given value X.
Next we can identify the optimal alignment
for each extended hit cluster by dynamic programming,
thus reducing the computational time considerably.
BLAST also uses some other tricks to further enhance its speed and sensitivity.
First, BLAST will mask (i.e. mark as “to be ignored” ) low-complexity sequences
that have numerous copies in the genomes to avoid excessive false positive hits.
Whether a sequence is low-complexity is measured by its information content.
This (information content) formula is very important.
Without it there are too many seed words to choose and use.
More importantly, without it we cannot avoid selecting seed words from the low-complexity regions,
leading to lots of false positive hits.
For example, the “CACACA” repeat
is a typical microsatellite sequence
that has a K value of 0.36.
This sequence will thus be masked to prevent it from giving false positives in database searches.
Also, to improve the sensetivity ,BLAST will consider not only the seed words from the query sequence,
but also the “neighborhood words” similar to the seed words.
The similarity here is evaluated by the substitution matrix.
For example, the similarity between the seed word “DKT”
and its neightbourhood word “DRT” will be 6+2+5=13.
Similarly, the similarity between all possible neighborhood words
and a seed word can be worked out.
Of course, we should only consider highly similar neighborhood words to reduce false positives.
In practice, in the latest version of BLAST,
a neighborhood word will only be considered if it has a similarity of 11 or more with the seed word.