| |
|
 |
|
Data Download
HapMap 3 FTP data
ENCODE 3 FTP data
|
 |
|
This is draft release 1 for genome-wide SNP genotyping
and targeted sequencing in DNA samples from a variety of human
populations (sometimes referred to as the "HapMap 3" samples).
This release contains the following data:
- SNP genotype data generated from 1115 samples, collected
using two platforms: the Illumina Human1M (by the Wellcome Trust
Sanger Institute) and the Affymetrix SNP 6.0 (by the Broad
Institute). Data from the two platforms have been merged for this
release.
- PCR-based resequencing data (by Baylor College of Medicine
Human Genome Sequencing Center) across ten 100-kb regions
(collectively referred to as "ENCODE 3") in 712 samples.
Since this is a draft release, we ask you to check this site
regularly for updates and new releases.
|
 |
| |
|
 |
| |
|
 |
| |
The HapMap 3 sample collection comprises 1,301 samples
(including the original 270 samples used in Phase I and II of the International HapMap Project) from
11 populations, listed below alphabetically by their 3-letter
labels. For more information about these samples, click here.
|
 |
| |
Five of the ten ENCODE 3 regions overlap with the HapMap-ENCODE regions; the other five are regions selected at random from the
ENCODE target regions (excluding the 10 HapMap-ENCODE regions). All ENCODE 3 regions are 100-kb in size, and are centered within each
respective ENCODE region. Read more about the ENCODE project here.
region |
chromosome |
coordinates (NCBI build 36) |
status |
ENm010 |
7 |
27,124,046-27,224,045 |
HapMap-ENCODE |
ENr321 |
8 |
119,082,221-119,182,220 |
HapMap-ENCODE |
ENr232 |
9 |
130,925,123-131,025,122 |
HapMap-ENCODE |
ENr123 |
12 |
38,826,477-38,926,476 |
HapMap-ENCODE |
ENr213 |
18 |
23,919,232-24,019,231 |
HapMap-ENCODE |
ENr331 |
2 |
220,185,590-220,285,589 |
New |
ENr221 |
2 |
56,071,007-56,171,006 |
New |
ENr233 |
15 |
41,720,089-41,820,088 |
New |
ENr313 |
16 |
61,033,950-61,133,949 |
New |
ENr133 |
21 |
39,444,467-39,544,466 |
New |
|
 |
| |
A. SNP GENOTYPE DATA
label |
number of samples |
number of QC+ SNPs |
number of polymorphic QC+ SNPs |
ASW |
71 |
1632186 |
1536247 |
CEU |
162 |
1634020 |
1403896 |
CHB |
82 |
1637672 |
1311113 |
CHD |
70 |
1619203 |
1270600 |
GIH |
83 |
1631060 |
1391578 |
JPT |
82 |
1637610 |
1272736 |
LWK |
83 |
1631688 |
1507520 |
MEX |
71 |
1614892 |
1430334 |
MKK |
171 |
1621427 |
1525239 |
TSI |
77 |
1629957 |
1393925 |
YRI |
163 |
1634666 |
1484416 |
consensus |
1115 |
1525445 |
1490422 |
B. PCR RESEQUENCING DATA
label |
number of samples |
ASW |
55 |
CEU |
119 |
CHB |
90 |
CHD |
30 |
GIH |
60 |
JPT |
91 |
LWK |
60 |
MEX |
27 |
MKK |
0 |
TSI |
60 |
YRI |
120 |
total |
712 |
|
 |
| |
A. SNP GENOTYPE DATA
Genotyping concordance between the two platforms was 0.9931 (computed over 249889 overlapping SNPs). Data from the two platforms was merged
using PLINK (--merge-mode 1), keeping only genotype calls if there is consensus between non-missing genotype
calls (that is, merged genotype is set to missing if the two platforms give different, non-missing calls).
Quality control at the individual level was performed separately by the two sites. Only individuals with genotype data on both platforms were
kept in this release. The following criteria were used to keep SNPs in the QC+ data sets:
Hardy-Weinberg p>0.000001 (per population)
missingness
3 Mendel errors (per population; only applies to YRI, CEU, ASW, MEX, MKK)
SNP must have a rsID and map to a unique genomic location
The "consensus" data set contains data for 1115 individuals (558 males, 557 females; 924 founders and 191 non-founders), only keeping SNPs that passed QC in all
populations (overall call rate is 0.998). The "consensus|polymorphic" data set has 35023 monomorphic SNPs (across the entire data set) removed.
In all genotype files, alleles are expressed as being on the (+/fwd) strand of NCBI build 36.
B. PCR RESEQUENCING DATA
The sequence-based variant calls were generated by tiling with PCR primer sets spaced approximately 800 bases apart across the ENCODE 3 regions. Following
filtering low-quality reads the data were analyzed with SNP Detector version 3, for polymorphic site discovery and individual genotype calling. Various QC filters
were then applied. Specifically, we filtered out PCR amplicons with too many SNPs, and SNPs with discordant allele calls in mutliple amplicons. We also filtered out
SNPs with low completeness in samples, or with too many conflicting genotype calls in two different strands.
In the QC+ data set, we filtered out samples with low completeness, and filtered out SNPs with low call rate in each population (
|
 |
| |
A. SNP GENOTYPE DATA
Missing from this release are Illumina SNPs that are A/T or C/G due to strandedness issues.
Missing from this release are Illumina SNPs that are mitochondrial (as they do not have rsIDs).
There may be few remaining SNPs (Illumina) in this release that are still on (-/rev) strand of NCBI build 36, but they are not A/T or C/G SNPs, so easy to identify downstream.
B. PCR RESEQUENCING DATA
All variant calls have not yet been validated: we estimate that there is currently a false positive rate of ~12% among all calls, with a slightly higher rate (~14%) if considering
just the singletons. Additional validation is ongoing. PCR sequencing of additional samples (MKK) is also ongoing.
|
 |
| |
A. SNP GENOTYPE DATA
To download the HapMap 3 data from our ftp site, click here.
B. PCR RESEQUENCING DATA
To download the ENCODE 3 data from our ftp site, click here.
|
 |
| |
Listed below are the analysis plans that we are currently pursuing:
SNP allele frequency estimation
Population differentiation
Linkage disequilibrium analysis
SNP tagging
Imputation efficiency
Genomic locations of human CNVs
Genotypes for CNVs
Population genetic properties of CNVs (allele frequencies, population differentiation, etc.)
Mutation rate (frequency of de novo CNV) and potential mutational mechanisms
Linkage disequilibrium properties of CNVs
Tagging and imputation of CNVs
Signals of selection around CNVs
Association of SNPs and CNVs with expression phenotypes
|
 |
| |
The release of pre-publication data from large resource-generating scientific projects was the subject of a meeting held in January 2003, the "Fort Lauderdale" meeting. An NHGRI policy
statement based on the outcome of the meeting is on the NHGRI web site (http://www.genome.gov/10506537).
The recommendations of the Fort Lauderdale meeting address the roles and responsibilities of data producers, data users, and funders of "community resource projects", with the
aim of establishing and maintaining an appropriate balance between the interests of data users in rapid access to data and the needs of data producers to receive recognition for
their work. The conclusion of the attendees at the meeting was that responsible use of the data is necessary to ensure that first-rate data producers will continue to participate
in such projects and produce and quickly release valuable large-scale data sets. "Responsible use" was defined as allowing the data producers to have the opportunity to publish
the initial global analyses of the data, as articulated at the outset of the project. Doing so also will ensure that the data generated are fully described.
|
 |
| |
|
|
|