Notes on Draft Data Present in GenBank
Insert length
The insert length of a clone is determined through two methods. First
a high resolution fingerprint is generated and band sizes are determined
by comparison to size standards. The sum of the lengths of the bands identified
provides an estimate of the expected size of the clone. Vector bands are
subtracted to arrive at the insert size. Difficulties in identifying double
and triple bands as well as very small and very large bands detract from
the accuracy of this method.
The second method of determining insert size is to add up the size of
all the contigs larger than 1 kb found in the assembly. Vector sequences
are screened prior to assembly so this method gives an estimate of the
insert size of the clone. Mis-assemblies and overlapping contigs within
the assembly cause some problems in calculating the true insert size.
Coverage calculation
Coverage is calculated in the following way: The phrap.seq.ace file is
parsed to find contigs that are larger than 1 kb. For each read that is
present in these contigs, the number of bases that meet the minimum quality
score of phred20 is tallied. The length of each contig is determined by
trimming the ends of the contig to remove bases with quality equal to
phrap 0. These lengths are then added together to provide the insert size.
Coverage = (Total Phred20 Bases in contigs > 1kb) / (Sum of contig lengths)
It should be noted that the sum of contigs length will differ from the
total length reported in GenBank entries. First there are 100 n's placed
between contigs in the entry which are counted in GenBank but not in the
contig sum. Secondly the trimming of phrap0 bases will reduce the size
by a few kilobases per entry.
|