NEXT GENERATION GENOTYPING (NGG) FOR POPULATION SCALE GENOMIC STUDIES USING RIPTIDE™ DNA LIBRARY PREPARATION
Keith Brown,3 Azeem Siddique,1,3 Gaia Suckow,1,3 Nils Homer,2 Jay Carey,2 Phillip Ordoukhanian,1,3 Steve Head,1,3 Joseph Pickrell5 , Ryan Kim5 1The Scripps Research Institute, La Jolla, CA, USA; 2Fulcrum Genomics, Somerville, MA, USA; 3iGenomX, Carlsbad, CA; 4Macrogen, Rockville, MD, USA; 5Gencove, New York, NY, USA
Over the past two decades, large population genotyping studies have delivered a better understanding of disease, an expanded pipeline of therapeutic targets and new diagnostic/predictive tests to improve health and wellness. These advancements are a result of innovation in genotyping technology. From realtime PCR and TaqMan assays through microarrays and sequencing, the drive for more information at a reduced cost has been consistent. Today’s most comprehensive microarrays can provide over 5 million single nucleotide variants (SNVs) at a cost of approximately $0.001 per genotype ($450 / sample). We are presenting a Next-Generation Genotyping (NGG) approach and a novel DNA library prep product that enables the sequence-based genotyping of more than 37 million genetic markers at a cost of less than $0.000002 / genotype (less than $80/sample). In addition to the dramatic increase in marker density throughout the genome, NGG avoids the ascertainment bias associated with static content microarrays.
NGG is enabled by high-throughput sequencers with the main application criteria being low error rates and high read counts. As the widespread adoption of NGS platforms for applications like NIPT has shown, achieving high quality results at low cost is simply a matter of obtaining just enough evidence (defined as number of reads) to make high confidence calls on known targets. Sequencer reads are mapped to a reference genome, a haplotype is identified, and genotypes are assigned. Twenty plus years of sequencing and genotyping have made this approach robust today. In this study, we evaluate the minimum performance criteria required for genotyping to high precision and sensitivity, using the iGenomX RipTide high throughput DNA library preparation product.
Comparison of Genotyping products
• >10x more variants than microarrays
• No fixed content bias
• Improved signals for GWAS and PRS
960 genomic DNA samples from the 1,000 genomes project (1KG) were obtained from the Coriell Institute (New Jersey). 50ng of input per sample was prepared using RIPTIDE and sequenced on a single Illumina NovaSeq S4 flow cell by Macrogen (Maryland). FastQ files were sent to Gencove (New York) for genotype calling and VCF generation. VCFs were compared to public genotypes (Illumina GSA microarray) at NIST high confidence loci to analyze precision, sensitivity and accuracy across platforms. Processing time is estimated as less than 5 hours for library prep (with automation), 40 hours for sequencing and 24 hours for genotyping and VCF generation.
Of the 38 million Gencove variant calls per sample, approximately 150 thousand variants are found on the Illumina GSA array within the NIST (GIAB) high confidence regions of the human genome. These variants have been assessed for precision (TP/TP+FP), sensitivity (TP/TP+FN) and accuracy (f-measure).
The first 960 sample set generated about 2.8Gbp of data. Per sample read counts averaged 20 million across 960 samples with a range of a few million to over 100 million. More than 90% of the samples achieved a minimum read count less than 1 standard deviation below the mean. The Gencove pipeline can make calls on 38 million bi-allelic variants for each sample regardless of read count or coverage. Comparing Gencove VCF calls of these samples with GSA array genotypes from public sources allows us to determine the relationship between read count and coverage with precision, sensitivity and accuracy (see tables below).
Of the 934 unique samples, 840 had GSA array data available through the 1KG project (phase 3 release). Published specifications for the NovaSeq S4 flow cell shows the sequencer will generate 16-20 billion reads, meaning that 384-480 individually barcoded samples per flow cell will generate between 33 million and 52 million average reads per sample with optimal performance. Using a published price for S4 flow cell from Texas A&M University of approximately $25,000, the cost per sample for sequencing amounts to $52-$65 (384 -480 samples per flow cell).
Shown below are a list of the samples used in this study and a PCA plot of the samples using our genotyping data to demonstrate ancestry (PCA data provided by Cincinatti Childrens Hospital Medical Center).
• Simple, high throughput workflow
• <$80 per sample (prep, sequencing, and analysis)
• >37M bi-allelic variants (SNV, in-del, CNV) with high precision, sensitivity and accuracy
• One Novaseq sequencer can process approximately 2,880 samples per week
• One application for all genomes (human or other)