The PHRAP Algorithm
PHRAP, or "phragment assembly program" assembles shotgun DNA sequence data.
Outline of PHRAP assembly quoted from www.genome.ou.edu/phrap.html:
- Read in sequence & quality data, trim off any near-homopolymer runs at ends of reads, construct read complements.
- Find pairs of reads with matching words. Eliminate exact duplicate reads. Do swat comparisons of pairs of reads which have matching words, compute (complexity-adjusted) swat score.
- Find probable vector matches and mark so they aren't used in assembly.
- Find near duplicate reads.
- Find reads with self-matches.
- Find matching read pairs that are "node-rejected" i.e. do not have "solid" matching segments.
- Use pairwise matches to identify confirmed parts of reads; use these to compute revised quality values.
- Compute LLR scores for each match (based on qualities of discrepant and matching bases).
(Iterate above two steps).
- Find best alignment for each matching pair of reads that have more than one significant alignment in a given region (highest LLR-scores among several overlapping).
- Identify probable chimeric and deletion reads (the latter are withheld from assembly).
- Construct contig layouts, using consistent pairwise matches in decreasing score order (greedy algorithm). Consistency of layout is checked at pairwise comparison level.
- Construct contig sequence as a mosaic of the highest quality parts of the reads.
- Align reads to contig; tabulate inconsistencies (read / contig discrepancies) & possible sites of misassembly. Adjust LLR-scores of contig sequence.