The Smith-Waterman algorithm is a database search algorithm developed by T.F. Smith and M.S. Waterman, and based on an earlier model appropriately named Needleman and Wunsch after its original creators. The S-W Algorithm implements a technique called dynamic programming, which takes alignments of any length, at any location, in any sequence, and determines whether an optimal alignment can be found. Based on these calculations, scores or weights are assigned to each character-to-character comparison: positive for exact matches/substitutions, negative for insertions/deletions. In weight matrices, scores are added together and the highest scoring alignment is reported.
To put it simply, dynamic programming finds solutions to smaller pieces of the problem and then puts them all together to form a complete and optimal final solution to the entire problem.It is superior to the BLAST and FASTA algorithms because it searches a larger field of possibilities, making it a more sensitive technique; however, individual pair-wise comparisons between letter slows the process down significantly.
Instead of looking at an entire sequence at once, the S-W algorithm compares multi-lengthed segments, looking for whichever segment maximizes the scoring measure. The algorithm itself is recursive in nature:
H i j = max{H i-1, j-1 + s(a i, b j ); H i-k, j - W k ; H i, j-1 - W 1 ;0}
So what are we doing?
In essence, we are converting one string (sequence of letters) into another string by performing certain operations on the individual characters that make up that string. We can insert a character or delete a character from the first string, or we can substitute a character in the first string with a character from the second string. Thus, an insertion into one string results in the simultaneous deletion from the second string. What we call an edit distance, is just the minimum number of operations we perform to convert one string into another.
So the similarity of two strings is simply the value of the alignment between the two strings that maximizes the total alignment value (optimal alignment value), or the highest score given. Then how do we start figuring out the alignment?
Global alignment is obtained by first inserting chosen spaces (or dashes) either into or at the ends of the two strings, and then placing the two resulting strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string.
Let's say we have the sequence GAGTCGC and CTATGCA. We first line them up and compare the edit distance. Count the number of r's and d's (which in this case is 5) and that is the edit distance, while the number of m's is the similarity score (which equals 4).
G - A G T C G C - r i m d m d m m i C T A - T - G C Afor edit distance scoring,
From the scoring matrix, the S-W algorithm finds the highest score and traces back the path of the previous highest scores. This in essence determines the places in the query sequence in which to place dashes (insertions or deletions). The calculated path can then report to the user the sequence in the database that best matches the query string.
the information on this page is a reference to:
http://www.maths.tcd.ie/~lily/pres2/sld009.htm
photo source:
photographer: Peter Dazeley
http://www.viewimages.com/viewimage/?imageid=243138&promotionid=1&partnerid=2\&type=results
matrix source:
http://www.bimcore.emory.edu/Kins/gmb790r/Lect2/sld013.htm