Human Genome Sequencing Center, Baylor College of Medicine
 
 


Notes on Draft Data Present in GenBank


Insert length

The insert length of a clone is determined through two methods. First a high resolution fingerprint is generated and band sizes are determined by comparison to size standards. The sum of the lengths of the bands identified provides an estimate of the expected size of the clone. Vector bands are subtracted to arrive at the insert size. Difficulties in identifying double and triple bands as well as very small and very large bands detract from the accuracy of this method.

The second method of determining insert size is to add up the size of all the contigs larger than 1 kb found in the assembly. Vector sequences are screened prior to assembly so this method gives an estimate of the insert size of the clone. Mis-assemblies and overlapping contigs within the assembly cause some problems in calculating the true insert size.

 

Coverage calculation

Coverage is calculated in the following way: The phrap.seq.ace file is parsed to find contigs that are larger than 1 kb. For each read that is present in these contigs, the number of bases that meet the minimum quality score of phred20 is tallied. The length of each contig is determined by trimming the ends of the contig to remove bases with quality equal to phrap 0. These lengths are then added together to provide the insert size.

Coverage = (Total Phred20 Bases in contigs > 1kb) / (Sum of contig lengths)

It should be noted that the sum of contigs length will differ from the total length reported in GenBank entries. First there are 100 n's placed between contigs in the entry which are counted in GenBank but not in the contig sum. Secondly the trimming of phrap0 bases will reduce the size by a few kilobases per entry.

 
.
BCM HGSC