The FastA Algorithm

The FastA algorithm allows for the comparison of a query sequence to a DNA sequence database. The algorithm uses a fast search to initially identify sequences from the database with a high degree of similarity to the query sequence. Then it conducts a second comparison on the selected sequences.

While FastA is actually just a fast approximation to the Smith-Waterman algorithm, it is slower and more sensitive than the BLAST algorithm because FastA tolerates gaps in the aligned sequences.


In the first step of this search, the comparison can be viewed as a set of many dot plots with the query sequence on the vertical axis and each sequence in the database on the horizontal axis of its particular plot.

Set a word size. (A word is a short sequence of nucleotides. A word of size 2 could be, for instance, gg.) Place a dot wherever words of this size match.

For example, if the query sequence were ggctttcgg and the database sequence were aacggcttacg, you would get the following plot:

The two diagonal series of dots in the figure indicate that the two sequences are identical over these diagonals. The purpose of this first step is to find the longest diagonals, or highest scoring regions.

In the second step, rescore the 10 best diagonals using a scoring matrix that allows and takes into account conservative replacements and ambiguity symbols shorter than the size of a word. This analysis finds high scoring subregions within the diagonals, called "initial regions."

In step three, take the initial regions whose scores are above a predetermined threshold and check to see if they can be joined together. Impose a penalty on joined regions so that joined pieces have lower scores than continuous runs.

Finally, use a variant of the last part of the Smith-Waterman algorithm to align the sequence and calculate the optimal score. If the optimal score is above a certain threshold, place the sequence in the match list.