Smith-Waterman Algorithm


The Smith-Waterman algorithm is a database search algorithm developed by T.F. Smith and M.S. Waterman, and based on an earlier model appropriately named Needleman and Wunsch after its original creators. The S-W Algorithm implements a technique called dynamic programming, which takes alignments of any length, at any location, in any sequence, and determines whether an optimal alignment can be found. Based on these calculations, scores or weights are assigned to each character-to-character comparison: positive for exact matches/substitutions, negative for insertions/deletions. In weight matrices, scores are added together and the highest scoring alignment is reported.

To put it simply, dynamic programming finds solutions to smaller pieces of the problem and then puts them all together to form a complete and optimal final solution to the entire problem.

It is superior to the BLAST and FASTA algorithms because it searches a larger field of possibilities, making it a more sensitive technique; however, individual pair-wise comparisons between letter slows the process down significantly.

Instead of looking at an entire sequence at once, the S-W algorithm compares multi-lengthed segments, looking for whichever segment maximizes the scoring measure. The algorithm itself is recursive in nature:

H i j = max{H i-1, j-1 + s(a i, b j ); H i-k, j - W k ; H i, j-1 - W 1 ;0}

So what are we doing?

In essence, we are converting one string (sequence of letters) into another string by performing certain operations on the individual characters that make up that string. We can insert a character or delete a character from the first string, or we can substitute a character in the first string with a character from the second string. Thus, an insertion into one string results in the simultaneous deletion from the second string. What we call an edit distance, is just the minimum number of operations we perform to convert one string into another. So the similarity of two strings is simply the value of the alignment between the two strings that maximizes the total alignment value (optimal alignment value), or the highest score given. Then how do we start figuring out the alignment? Global alignment is obtained by first inserting chosen spaces (or dashes) either into or at the ends of the two strings, and then placing the two resulting strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string.

Let's say we have the sequence GAGTCGC and CTATGCA. We first line them up and compare the edit distance. Count the number of r's and d's (which in this case is 5) and that is the edit distance, while the number of m's is the similarity score (which equals 4).

                       G - A G T C G C - 
                       r i m d m d m m i 
                       C T A - T - G C A
for edit distance scoring,
r = replacement = 1
m = match = 0
i, d = insertion/deletion = 1

for similarity scoring,
m = matches = 1 and the rest are 0

edit distance = 5
similarity = 4

two strings are compared
A = a1 a2 ... an
B = b1 b2 ... bm
similarity s(a,b)
is given by a weight matrix
first term
corresponds to extending the alignment by one residue from each sequence
second term
describes extending the alignment by including a residue from sequence (string) b and inserting a gap of k residues in length, aligned to end with a residue of sequence (string) b, into sequence a
third term
same as inserting a gap into sequence b
fourth term
0, allows for partial scores within the table to become negative

putting zero in the recursion accounts for the case that if the alignment score becomes negative, it is ignored and the preceding calculations are ignored as well, and start over from a neutral score
matrix
first fill with zeros
calculate H i j
where W k = 1 + (1/3) * k and k = gap length
choose the highest score
there are only four possible ways to forming a path:
  1. align with next residue of database sequence (score is previous score plus similarity score for the two residues
  2. deletion
  3. insertion
  4. zero

From the scoring matrix, the S-W algorithm finds the highest score and traces back the path of the previous highest scores. This in essence determines the places in the query sequence in which to place dashes (insertions or deletions). The calculated path can then report to the user the sequence in the database that best matches the query string.

the information on this page is a reference to:

http://www.maths.tcd.ie/~lily/pres2/sld009.htm

photo source:

photographer: Peter Dazeley
http://www.viewimages.com/viewimage/?imageid=243138&promotionid=1&partnerid=2\&type=results

matrix source:


http://www.bimcore.emory.edu/Kins/gmb790r/Lect2/sld013.htm