The BLAST Algorithm


There are seven BLAST programs available. They were developed by Altschul, et. al. BLASTN searches for matches between a nucleic acid query and nucleotide database, but is rarely used. Matching proteins using the algorithm is far more common. BLAST is faster than FASTA and Smith-Waterman, although less sensitive and unable to consider gaps.


BLAST is a simplification of the Smith-Waterman algorithm. Like FASTA, it is fast because it looks for exact matches of short words, instead of for similar sequences that match due to gaps in either the query or target. Unlike FASTA, it scores these matches using all the values in a similarity matrix.

The BLAST word based heuristic uses a default word (w-mer where w is the length) size of three for proteins and eleven for nucleic acid sequences, but the following example (from www.psc.edu/biomed/TUTORIALS/SEQUENCE/DBSEARCH/tutorial.html uses a word size of two for the sake of simplicity.

Take the following, Adipokinetic hormone II of the migratory locust.
qlnfsagw
ql
.ln
..nf
...fs
....sa
.....ag
......gw
Thus, the sequence has been split into words of size two. A sequence of length n will result in an initial word list of n-w+1 entries. In this case, this results in 8-2+1=7.

Next, any word in the database, that scores at least a minimum threshold value (according to some scoring scheme such as PAM 120 or BLOSUM) when aligned with any of the initial list of words, is added to the list. Below is a list of 47 words derived from the initial 7. The expanded list contains any word that scores at least eight (according to the PAM 120 matrix) when aligned with one of the 7 words.

Initial Word Expanded List
ql ql, qm, hl, zl
ln ln, lb
nf nf, af, ny, df, qf, ef, gf, hf, kf, sf, tf, bf, zf
fs fs, fa, fn, fd, fg, fp, ft, fb, ys
sa nothing scores 8 or higher
ag ag
gw gw, aw, rw, nw, dw, qw, ew, hw, iw, kw, mw, pw, sw, tw, vw, bw, zw, xw

The third step of the algorithm, now that this list of query words has been compiled, is to search for them in the database. When an exact match is found, the algorithm calls for looking in both directions for the rest of the sequence.

As was mentioned earlier, BLAST gains speed at the cost of sensitivity: it does not consider gaps (i.e. there must not be signficant insertions or deletions). The following example (from avery.rutgers.edu/WSSP/StudentScholars/Session15/Session15.html) shows the drawback of this.

ACDEEFGH ACDEFGH

BLAST will find ACDE because it is in both sequences. When it tries to extend the seed toward the right, though, the duplicate E in the top sequence doesn't allow a high similarity score.

BLAST may identify the EFGH identity located at the right end of the two sequences and make it a seed. When looking to the left though, it will not extend it as because that would require inserting a gap in the lower sequence.


Try out BLAST at www.ncbi.nlm.nih.gov/blast/blast.cgi. Newer versions of BLAST improve on this defect, though often at the cost of speed.




Sources:
avery.rutgers.edu/WSSP/StudentScholars/Session15/Session15.html
www.psc.edu/biomed/TUTORIALS/SEQUENCE/DBSEARCH/tutorial.html. Viewed 9-17-00.
bimas.dcrt.nih.gov/blastinfo/blastref.html. Viewed 9-17-00.
Bishop, Martin (ed.). Guide to Human Genome Computing. 2nd ed. San Diego: Academic Press (1998).