The PHRAP Algorithm

PHRAP, or "phragment assembly program" assembles shotgun DNA sequence data.

Outline of PHRAP assembly quoted from
  1. Read in sequence & quality data, trim off any near-homopolymer runs at ends of reads, construct read complements.
  2. Find pairs of reads with matching words. Eliminate exact duplicate reads. Do swat comparisons of pairs of reads which have matching words, compute (complexity-adjusted) swat score.
  3. Find probable vector matches and mark so they aren't used in assembly.
  4. Find near duplicate reads.
  5. Find reads with self-matches.
  6. Find matching read pairs that are "node-rejected" i.e. do not have "solid" matching segments.
  7. Use pairwise matches to identify confirmed parts of reads; use these to compute revised quality values.
  8. Compute LLR scores for each match (based on qualities of discrepant and matching bases).
    (Iterate above two steps).
  9. Find best alignment for each matching pair of reads that have more than one significant alignment in a given region (highest LLR-scores among several overlapping).
  10. Identify probable chimeric and deletion reads (the latter are withheld from assembly).
  11. Construct contig layouts, using consistent pairwise matches in decreasing score order (greedy algorithm). Consistency of layout is checked at pairwise comparison level.
  12. Construct contig sequence as a mosaic of the highest quality parts of the reads.
  13. Align reads to contig; tabulate inconsistencies (read / contig discrepancies) & possible sites of misassembly. Adjust LLR-scores of contig sequence.