Sequence Reconstruction Algorithm


In the shotgunning approach to sequencing, small fragments of DNA must be reassembled by the computer back into the complete sequence. Essentially, it is the shortest common superstring problem: we have fragments and we want to find the shortest sequence containing all the fragments. Consider the example below based on a similar one in Waterman's book. It does not involve any fragment reversals or errors in reading.

f1=GCGC
f2=CGCC
f3=GGCG

The shortest common superstring is:
.GGCGCC
f1=.GCGC
f2=..CGCC
f3=GGCG

The shortest superstring problem can be displayed as a Hamiltonian path and shown to be equivalent to the travelling salesman problem. It is NP-complete. A greedy algorithm exists which sequentially merges fragments starting with the pair with the most overlap first.

Greedy algorithm for shortest superstring

Let T be the set of all fragments and let S be an empty set.
While T is not empty, do
{
    For the pair s,t in T with maximum overlap (s=t is allowed)
    {
        If s is different from t, merge s and t.
        If s = t, remove s from T and add s to S.	
    }
}
Output the concatenation of the elements of S.


This is not a quick algorithm even though it ignores real-world problems such as which direction a fragment should go, errors in data, insertions, and deletions.




Sources:
Waterman, Michael. Introduction to Computational Biology. London: Chapman & Hall (1995).