APPLICATIONS

Single Molecule Assembly

Because of the complexity in the human genome sequence, the current standard practice is to align sequencer reads to a reference human genome.  This approach, called re-sequencing, reduces the capacity requirements of both the sequencer and the computer and therefore the cost.  However, re-sequencing does not allow for detection of long range information, such as structural variation and phase information.  Linked reads have been used for nearly two decades to improve human genome assemblies with short read sequencing.  Technologies such as mate-pair library generation, synthetic long reads, chromatin proximity ligation and microdroplet technology have been developed for this purpose.  Micro-droplet technology allows for a long distance between linked reads (>100kb) and more linked reads per long DNA molecule (10-40).  The problem is that current technologies are limited to use as a scaffold for a standard short read genome assembly.  This is due to sequence bias, error propagation and sequencing artifacts.  As a result, linked read assembly requires the combination of two data sets from two different libraries and two sequencing runs at nearly 3 times the cost of a standard short read genome assembly.  However, when iGenomX master mix is used with high throughput micro-droplet technology for linked read analysis, the result is a high quality genome assembly from a single technology that allows for both variant identification and linked read information. 


Linked reads workflow

Figure 1: Linked reads workflow.



Phased identification

Figure 2: Reads from the same barcode are “linked” to the same long DNA molecule.



13kb molecule showing short read coverage

Figure 3: A 13kb molecule example showing short read coverage. Reads are sorted by their 5’ start position.  Each short read is represented by a red hash on a separate line on the vertical axis.  Longer molecules do not show well visually in UCSC browser.



Cumulative fraction of all molecule lengths assembled from NA12878

Figure 4: Cumulative fraction of all molecule lengths assembled from NA12878. The average length of assembled molecules = 83kb.  Greater than 90% of all molecules assembled are longer than 20kb.



Table 1: Variant calling of NA12878.  GATK best practices used.

Variant calling of NA12878


Table 2:Phasing statistics on NA12878.

Phasing statistics on NA12878


Table 3: Variant calling and phase statistics on additional samples.

Variant calling and phase statistics on additional samples



SMA analysis pipeline

Figure 5: iGenomX SMA analysis pipeline.



Short reads are aligned to genome

Figure 6: Short reads are aligned to genome.



Short reads are grouped by barcode

Figure 7: Short reads are grouped by barcode.



Long molecules are assembled

Figure 8: Long molecules are assembled.



Variants are called

Figure 9: Variants are called.



Haplotypes are identified.

Figure 10: Haplotypes are identified.



Heterozygous variants only show haplotype differences.

Figure 11: Heterozygous variants only show haplotype differences.



x

Figure 12: IGV view of HLA locus zoomed in to 5kb.  Coverage track shows reference bases (grey) and non-reference bases (red, green, blue and yellow).  Molecule track shows sequenced bases (color) and gaps (grey).  NA12878 has two distinct haplotypes in exons 2,3 of the HLA gene (vertical black boxes).