i5K Project Summary
About the Project
The BCM-HGSC is sequencing a number (~ 40-50) of arthropod genomes as a pilot project to kickstart the i5K.
The i5K is a initiative to sequence the genomes of 5,000 arthropod species. This pilot project builds on our extensive experience sequencing many arthropods over the years, including D. melanogaster, D. pseudoobscura, the honeybee, the red flour beetle, the pea aphid, the hessian fly, the centipede, and many others.
The i5K was first announced in March 2011 in a letter to Science Magazine and other press releases - for example, from the Entomological Society of America, to provide a base reference for understanding the molecular nature of arthrropods. It is our hope that this information be of medical, agricultural, ecological and scientific benefit to the world.
More information about the i5K can be found at the i5K wiki where you can additionally sign up to various roles and become involved in the larger projects goal of generating the genomes of 5000 arthropods.
Because of the relatively large number of species we will be using a table format to publicly track our progress, and release raw sequence data, assembled sequence data, transcriptome data and annotation data as soon as possible.
To select species we have worked with the i5K species selection committee, a group of more than 20 entomologists, biologists, and systematicists and genomics researchers. They have had multiple goals in the selection of species, including medical importance, agricultural importance, filling phylogenetic genomic holes, and attempts to addressing biological problems.
Sequence generation and Genome Assembly
The large number of species in the i5K pilot requires a low cost method of sequence generation, and a high quality whole genome assembly method.
Many people have asked us the best way of performing such assemblies, and the data and computer hardware and software needed. New sequencing methods, most notably the Illumina HiSeq technology, have allowed sequence information to be generated at relatively low cost to enable this pilot project.
Material for DNA isolation
For arthropods genome sequencing, we recommend that the input DNA sequence be as non-polymorphic as possible. The ideal is a large haploid individual (for example from a large male hymenopteran allowing the generation of up to 50ug of genomic DNA. The next best is an inbred line, with 12-20 generations of sib-sib inbreeding. 12 generations theoretically makes 90% of the genome homozygous, 20 is theoretically close to 99% and any additional sib-sib inbreeding beyond this does not significantly reduce homozygocity in the sample. If inbreeding cannot be performed, the next best is a single large individual, so the assembler will only have to deal with a single diploid sequence. If the individual is small enough that multiple individuals are required for sufficient DNA, we recommend that the main library be made from a single individual using a low input DNA protocol, and DNA isolated from pooled individuals used for libraries of larger insert sizes requiring more DNA for gel cuts.
We currently recommend a qiagen kit - the Qiagen DNeasy Blood and Tissue Kit. Use the Animal Tissue (spin-Column) extraction protocol, making sure to complete the RNase step, otherwise there will be RNA contamination. This has worked well for DNA isolation from single Nasonia, but we still need to formally collect experiences with other DNA isolation protocols.
Sequence generation for assembly. For this project we are generating fairly high coverage in a number of different insert sized libraries. The assembly strategy is based around a seed allpaths (the Broad Allpaths assembler) assembly followed by seed assembly improvement using homegrown tools, Atlas-link (link to software page) and Atlas-GapFill, which can significantly improve the results.
Thus we generate sequence data to enable the Allpaths assembly As of Nov 2011 this is: - 40X genome coverage in 180bp insert library (100bp reads forward and reverse) 40X 3kb insert data. to enable better scaffolding and local gap filling we additionally generate 500bp, 1kb, 2kb, and 8kb insert sizes at > 20X coverage.
In addition to genome sequencing, we are also performing a modest amount of RNAseq to generate data for automated annotation. For each species we will generate RNA seq data for 3 tissues - usually whole adult males, whole adult females and mixed other lifestages. This data will be used with additional protein homology data for a MAKER automated annotation of the new genomes.
Additional analysis and annotation
by the i5K analysis groups
Each of the sequenced species will have a community analysis and publication group led by the researchers providing the DNA, to enable full analysis of each genome. These groups will additionally have help from the i5Ks multiple working groups.
Access to the Data
All data will be downloadible from the species table page as soon as it becomes available.