- What is the Human Transcript Database ?
- How do I use the database ?
- How is this database different from Unigene ?
- How is the database made ?
- Credits
Details of the Human Transcript Database
The cDNA search page was created as an interface to the Human Transcript Database (HTD). This database is a compilation of all of the mRNA sequences that have been submitted to GenBank. The goal of the transcript database is to provide a high quality
dataset representing as many of the human genes as possible, ETs are used solely for expression data where appropriate. Towards this end we have identified all of the mRNA sequences in GenBank and scored them according to the characteristics of the
sequencing procedure. The scoring system is described below.
Characteristic Number:
| 0 = |
No information registered as to the accuracy of the sequence |
| 1 = |
Most of the sequence is single stranded coverage |
| 2 = |
Most of the sequence is double stranded coverage
->Published sequences are placed in this category by default |
| 3 = |
Sequence is of the same quality as genomic (est. error rate = 1 in 10,000) |
| 4 = |
Sequence has been translated and has produced a protein |
For each sequence we also provide links to several alternative databases used to examine transcript information. Links to the Entrez entry are provided as well as links to the corresponding Unigene clusters, if applicable.
The search page allows one to select specific transcripts based on one or more criteria. In addition to performing BLAST searches, one can also search on any of the following fields:
| Field | Description |
| accession | accession number |
| locus | locus identification tag |
| NID | NID number from GenBank |
| clone | clone identification tag |
| library | library clone was taken from |
| date | date clone was put into GenBank |
| characteristic | characteristic of each sequence |
| chromosome | chromosome which the clone maps to |
| map | map location on the chromosome of the clone |
| description | description of what the clone is |
| keywords | keywords provided by the submitter of the sequence |
| authors | authors of the submission |
| title | title of the submission |
| journal | journal of submission - often used as a miscellaneous text field |
| gene | name of the gene if known |
| product | name of the gene product if known |
| unigene | link to the unigene cluster matched by the sequence |
| parent | indication of whether this is an index sequence (unique) or duplicate |
| organism | organism that clone is from - human is the only one used currently |
| length | length of transcripts |
Differences between Clustering databases and the Human Transcript Database
There are several databases exist that cluster all the information relevant to potential human transcripts into a single location. The sources of this data include complete human transcript information as well as EST data and theoretical gene predictions.
Due to the problems associated with EST and gene-prediction data these databases contain many non-biological transcripts. Furthermore, when consensus sequences are created they are electronic and no physical clone exists that can be obtained by an
interested investigator. To eliminate the noise and provide a clone based database we have assembled the Human Transcript Database. This database consists solely of sequences that are accurate and derived from a single clone which may be obtained by any researcher.
How to use the Human Transcript Search Engine
The Human Transcript Database (HTD) can be accessed through the search engine. This web interface allows one to perform a number of searches of the database and will return a table displaying all of the matching sequences that are present in the database.
There are two basic search mechanisms - content based and sequence based.
To search the database for the presence of keywords - for instance a particular author, use the search form. This searches the database for any keyword or characteristic of the entry. To search for sequences based upon their length, enter the minimum size desired or the size range desired and press the search key. A table will be displayed that shows all of the matching sequences. The length search may be performed in combination with the keyword search.
The database can also be used to compare a given sequence to all sequences in the HTD. This can be accomplished by either uploading a sequence to the web page or pasting a sequence into the given space. Note that there should be no header sequences in the pasted sequence but an uploaded sequence should be in FASTA format (first line is a description). The input sequence is compared to the database by a blast program (Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 'Gapped BLAST and PSI-BLAST: a new generation of protein database search programs', Nucleic Acids Res. 25:3389-3402 Abstract). The output is displayed as a graph with an associated table.
For more sensitive searches, the cutoff value for blast searches can be set (bottom of form) - please note that the blast program is very good at returning nearly exact matches but inexact matches are both less reliable and more computationally intensive.
The table that is displayed shows the data contained in a number of fields and links to other databases. The first column contains a link to the GenBank entry that is maintained at the National Center for Biotechnology Information. This represents
the original source of data that the Human Transcript Database was built from. The second field contains a link to the unigene clustering database. This database clusters together all similar sequences including EST sequences and predicted sequences. The
third column contains the accession number for the entry - this is a unique identifier that is often used for referencing this sequence. The fourth column contains the GID number which is also a unique number for each sequence but may change over time.
The fifth column contains a brief description of the entry. The sixth column represents the length of the clone in the database. The seventh column contains a link to the databank entry which will display all of the information we maintain on the
sequence. For sequence comparison searches, the score of the match is displayed in a eighth column.
Human Transcript Database assembly methods
The transcript database was designed to identify and incorporate all of the cDNAs that have been sequenced and deposited into GenBank. Here we describe the parameters that were used to identify the sequences as potential mRNA sequences and the culling of non-mRNA sequences. At every step, effort was taken to incorporate as many sequences as possible which results in the presence of a small amount of non-mRNA derived sequences. We have tried to identify these sequences and rid the database of them but
nevertheless there is a small amount of contamination.
Description of Method:
The asn1 files from the NCBI were downloaded and examined to find the sequences that claimed to be mRNA sequences or were of unknown type. The accession numbers from these sequences were then used to parse through the corresponding GenBank file format
and extract all of the information present in the database. At this point, the sequences were tested for the presence of one of a number of keywords (e.g. BAC). Entries containing one of these keywords were removed from the database. Following this
culling step, the final list of all of the accession numbers was used to parse through the fasta file of all of GenBank to create a concatenated fasta file of all of the transcripts to be used for Blast searches. Lastly, the database was examined for
transcripts that had been cut up into individual exons and the joined transcript was produced.
Caveats:
The approach we have taken focuses on the incorporation of as many transcripts as possible. Reasonable effort has been made to exclude those sequences that are not derived from mRNAs, however such sequences exist within the database. Chimeric
and genomic cDNA clones have not been rigorously excluded from the database.
Disclaimer:
The database assembled here is intended to assist in the acquisition of scientific knowledge. No claims as to the completeness or accuracy of the database is made.
Credits
Concept and Database Design: John Bouck and Kim Worley
Database Maintenance: John Bouck and Michael McLeod
Web and Interface Design: Michael McLeod and John Bouck
Thanks to: Harley Gorrell, Andrew Arenson, Michael McLeod, James Durbin, Pamela Culpepper, Wei Yu, and Kim Worley for their help programming, debugging and helpful discussions.
|