Index of /collaborations/mammals/wallaby
README for Genome Sequence of Macropus eugenii, Meug_1.1
(September 15, 2009)
0. Conditions for use
1. What's New
2. Introduction
3. Description of files
4. Sequence and Scaffold statistics
5. Read statistics
6. History
0. Conditions for Use.
The data may be freely downloaded, used in analyses, and repackaged in databases. Some of
the data presented here represents work in progress. It is being released by the Baylor
College of Medicine Human Genome Sequencing Center (BCM-HGSC) prior to project completion
as a public service to allow our colleagues to search for genes or functions and speed
their research. These data have not been edited and are presented "as is." You should
regard the data as preliminary if it is unpublished. The data providers and associated
funding agencies bear no responsibility for the user's reliance upon or interpretation
of these data. The accuracy or reliability of the data is not guaranteed or warranted
in any way and the providers disclaim liability of any kind. If you use this preliminary
information we request that you honor the following conditions:
1. Please communicate your results to us so that we can incorporate them into
the annotation of the final sequence. Contact us at hgsc-help@hgsc.bcm.tmc.edu.
2. Acknowledge the information obtained from BCM-HGSC in publications by stating
in Materials and Methods and Acknowledgements: "Preliminary sequence data was obtained
from Baylor College of Medicine Human Genome Sequencing Center website at
http://www.hgsc.bcm.tmc.edu." Also acknowledge our funding source, which is listed in each
project, with a statement such as "The DNA sequence of [organism] was supported by
[grant number from funding agency to PI] at the BCM-HGSC." We also request that you notify
us when your manuscript is accepted and send us a pre-print of the article.
3. Use of this data or information derived from it on a web page is permitted,
providing the web page contains the statement that "Preliminary sequence data was obtained
from the Baylor College of Medicine Human Genome Sequencing Center website at
http://www.hgsc.bcm.tmc.edu." Please inform us of your web page by sending email to
hgsc-help@hgsc.bcm.tmc.edu.
4. All other written or oral public disclosures of research using data from the
BCM-HGSC should follow the acknowledgment guidelines outlined above.
5. However, although we encourage use of this preliminary information for
limited studies, we request that you not publish whole genome or chromosome scale
analyses of genes or genomic data prior to the publication of the BCM-HGSC report on the
final genome sequence and analysis. Contact the BCM-HGSC at hgsc-help@hgsc.bcm.tmc.edu
to discuss a waiver of this request, which could involve simple acknowledgment,
co-authorship, or other methods.
6. Any redistribution of the data should carry this notice.
1. What's New
This is the second release (Meug_1.1) of the draft genome assembly of the Tammar wallaby,
Macropus eugenii. After the release of the first version (Meug_1.0), Macropus eugenii was
further sequenced using the ABI SOLiD technology to a sequence coverage of 5.9x. This
assembly is an upgraded version of Meug_1.0 using the paired-end reads from SOLiD
for superscaffolding, merging a large number of scaffolds from Meug_1.0 to form new
scaffolds and removing a small number of contigs in Meug_1.0 due to redundancy.
The assembly statistics below reflect these changes.
2. Introduction
This information is for the second release (Meug_1.1) of the draft genome sequence of the
wallaby, Macropus eugenii. This is a draft sequence and may contain errors so users
should exercise caution. Typical errors in draft genome sequences include misassemblies
of repeated sequences, collapses of repeated regions, and unmerged overlaps (e.g. due to
polymorphisms) creating artificial duplications. However base accuracy in contigs
(contiguous blocks of sequence) is usually very high with most errors near the
ends of contigs.
The release was produced by superscaffolding the previous assembly Meug_1.0 using paired-end
ABI SOLiD reads 25bp in length from small insert clones (average insert size is ~1.4k).
Out of the total ~350 million clones, 53 million are uniquely mapped for both F3 reads and
R3 reads. Two thirds of the uniquely mapping pairs (36 million) mapped within a single
scaffold, and these reads were used to evaluate the quality of the Meug_1.0 assembly.
Clones with one end in the wrong orientation or an inferred insert size that was too large
suggest possible misassemblies. See section 5 for details. One third (17 million) of the
uniquely mapping pairs were used to superscaffold the Meug_1.0 assembly.
The original contigs from the Meug_1.0 assembly were used in this newly scaffolded
assembly. The accession numbers for these contigs did not change. But not all of the
original contigs are found in the Meug_1.1 genome assembly.
3. Description of files
The files can be found on the HGSC ftp site:
ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Meugenii/fasta/
which is linked to the HGSC web site at www.hgsc.bcm.tmc.edu.
There are 4 directories containin files related to this release.
I. Meug20071125-freeze/contigs
Since the contigs did not change from the Meug1.0 assembly, this
directory has files for assembled contigs in the genome, there is no
chromosome assignment for the contigs in Meug_1.1.
ABQO01_accs.gz (conversion table of contig names to accession numbers)
Meug20071125-contigs.fa.gz (fasta file)
Meug20071125-contigs.fa.qual.gz (quality file)
Meug20071125-assembly.ace.gz (ACE file)
The files are gzipped. The directory contains the fasta formatted
sequence file of the contigs and the corresponding quality file. The
Meug20071125-contigs.agp.gz (AGP file) file in this directory
describes the positions and orientations of the contigs in the
previous, Meug1.0 assebly.
The Meug20071125-assembly.ace.gz file is the concatenation of all
the Phrap generated .ace format assembly files for the contigs.
II. Meugenii20090915/
This directory contains the files that have been changed to reflect
the Meug1.1 assembly. In addition to this README, there are 3 files.
Meugenii20090915-genome.agp (AGP file)
Meugenii20090915-genome.fa (fasta file)
Meugenii20090915-genome.fa.qual (quality file)
The AGP file describes how to combine the individual contigs to create
the linearized genome sequences. The individual contigs were not
changed from the Meug_1.0 assembly and those files can be found in the
Meug20071125-freeze/contigs directory.
The fasta formatted sequence file (Meugenii20090915-genome.fa) and
corresponding quality file (Meugenii20090915-genome.fa.qual) here are
for linearized scaffolds where the gaps between adjacent contigs
within a scaffold are filled with 'N's and the captured gap size is
estimated from the clone insert size. Each scaffold is a separate
sequence within the files.
III. Meug20071125-freeze/bin0
This directory has one fasta file, its corresponding quality file and
a library insert size table for bin0 reads. Those reads that were in
clusters (created by Atlas_Overlapper) of 3 or fewer reads are
collectively called bin0 reads. These reads were not assembled in this
version. The files are gzipped.
The files are:
Meug20071125-bin0.fa.gz (fasta file)
Meug20071125-bin0.fa.qual.gz (quality file)
Meug_lib_insert_size_20071125.tbl (library insert size information)
The reads are in NCBI trace archive format:
>gnl|ti|1275187850 2070LUS280902O23.g
TCATTCGTCCCAGAGTGCTTCCAATCCTATCACTTTCCAAGACTTCTGGT
CCTTTGTACAGTTGTTCCCTCCACTAACTTCACATGCTTTACTGCCTTTT
The format of the lib_insert_size.tbl:
Library_Name Average_Insert_Size Standard_Deviation
The Library_Name is the first 4 letters of the
template. Average_Insert_Size and Standard_Deviation were calculated
by the mate pairs assembled into the same contig.
IV. Meug20071125-freeze/repeat-reads/
This directory has one fasta file and its corresponding quality file.
These reads were identified by their overlap number. A read's
overlap number is defined as the number of other reads that overlap
with it. The repetitive reads are identified in two ways:
a. Those reads whose overlapping numbers are >22.
b. Those reads which overlap a read whose overlapping number is >90.
The files are:
Meug20071125-repeat_reads.fa.gz (fasta file)
Meug20071125-repeat_reads.fa.qual.gz (quality file)
The format of the sequences is the same as above for the sequences in
bin0 directory.
4. Sequence and Scaffold statistics before upgrade and after upgrade
Genome before upgrade - assembly Meug_1.0
Scaffolds/Contigs Number N50(kb) Bases+Gaps(Mb) Bases(Mb)
All Scaffolds 616,418 16.05 2,945 2,549
All Contigs 1,211,471 2.5 2,549 2,549
Genome after upgrade - assembly Meug_1.1
All Scaffolds 277,711 36.60 3,075 2,536
All Contigs 1,174,382 2.6 2,536 2,536
Completeness
Genome before upgrade - assembly Meug_1.0
Length of cDNA reads aligned (stringency) 100% 95% 80% 50%
Fraction of cDNAs aligned to Scaffolds 25.4% 50.0% 71.3% 84.8%
Genome after upgrade - assembly Meug_1.1
Length of cDNA reads aligned (stringency) 100% 95% 80% 50%
Fraction of cDNAs aligned to Scaffolds 25.5% 50.4% 72.2% 85.8%
5. Read statistics
F3 R3 Total
Raw reads 356,241,796 357,736,722 713,978,518
Uniquely mapped clone reads[1] 52,995,899 52,995,899 105,991,798
Bridge scaffold reads[2] 17,158,908 17,158,908 34,317,816
In_scaffold reads[3] 35,836,991 35,836,991 71,673,982
Mis-orientated reads[4] 10,777 10,777 21,554
>5k insert size reads[5] 36,687 36,687 73,374
Sequence Coverage [6] 5.9x
[1] Reads from clones whose F3 end and R3 end both uniquely mapped.
[2] Reads which are from [1] and whose F3 end and R3 end are mapped to two different scaffolds.
[3] Reads which are from [1] and whose F3 end and R3 end are mapped to same scaffold.
[4] Reads which are from [3] and whose F3 end and R3 end are mapped in wrong orientation.
[5] Reads which are from [3] and whose inferred insert size from mapping are bigger than 5k, too big to be realistic.
[6] Sequence coverage was calculated as the total SOLiD reads bases divided by estimated genome size (3000 Mb).
6 History
Meug_1.1 (Sept, 2009)
This release is the second, preliminary assembly of the wallaby, Macropus eugenii
genome. This version is the linear sequence after superscaffolding Meug_1.0 using
SOLiD paired-end reads at the sequence coverage of ~5.9x.
Meug_1.0 (Feb, 2008)
This release is the first, preliminary assembly of the wallaby, Macropus eugenii
genome with ~2x sequence coverage of Sanger reads.