Atlas is a collection of software tools to facilitate the assembly of large genomes from whole genome shotgun reads, or a combination of whole genome shotgun reads and BAC or other localized reads.
This suite of tools has been used in the assemblies of the rat ( Rattus norvegicus), fruit fly (D. p seudoobscura), and the honey bee ( Apis mellifera); and will be used for the assemblies of sea urchin (Strongylocentrotus purpuratus), bovine (Bos taurus), and other species.
This is a new release of the Atlas software. Some components of the software used to produce the Rat assembly are not yet packaged for distribution. These will be added shortly. While we anticipate no problems running this software on different platforms, this software has not yet been extensively tested off-site. The older 2004 version of the Atlas software distribution is still available for the convenience of those who licensed that version. If you did not download it before, please don't download it now.
Binaries are available for linux and Sun Solaris Unix. A Mac OS X version will be available soon. They are packed in separate gzipped tar files atlas-linux.tgz and atlas-solaris.tgz.
Subroutines and variable for atlas-prep-reads, atlas-asm-wgs, and atlas-build-scaffold-file.
After downloading the file atlas-linux.tgz, or atlas-solaris.tgz, or atlas-osx.tgz (available soon) for the correct operating system, move it to a directory where you want to install the atlas assembly package, then unpack the file by typing the command:
tar xvzf atlas2005-linux.tgz, or
tar xvzf atlas2005-solaris.tgz.
You will get the following subdirectories and files:
documents/
documents/readme.html
documents/graphics/atlaslogo.gif
bin/
bin/atlas-overlapper
bin/atlas-splitbadcontigs
bin/atlas-screen-window
bin/atlas-binner
bin/atlas-trimphraptails
bin/atlas-linearsequence
bin/atlas-count-kmers
bin/atlas-extractbins
local/
perl/
perl/bin/
perl/bin/atlas-asm-wgs
perl/bin/atlas-build-scaffold-file
perl/bin/atlas-createindex
perl/bin/atlas-divide-fafile
perl/bin/atlas-prep-reads
perl/bin/atlas-screen-trim-file
perl/bin/atlas-separate-bin-assemble
perl/lib/
perl/lib/Atlas/
perl/lib/Atlas/AsmWgs.pm
perl/lib/Atlas/PrepReads.pm
perl/lib/Atlas/Scaffold.pm
perl/lib/Atlas/ScaffoldHeapEle.pm
perl/lib/Atlas/Utility/
perl/lib/Atlas/Utility/ObjectAttribute.pm
perl/lib/Atlas/Project/
perl/lib/Atlas/Project/Trace.pm
perl/lib/Atlas/Project/Contig.pm
data/
data/univec/
data/univec/README.uv
data/univec/univec.fa
data/demo/
data/demo/demo.001.fa
data/demo/demo.001.fa.qual
In bash shell, use the command
export ATLAS_ROOT=xxxx
You may want to put this line in the .bashrc file in your home directory.
In C shell, use the command
setenv ATLAS_ROOT xxxx
And the equivalent file is .cshrc.
You can either install them in the default perl library directory in your system, or in the ATLAS_ROOT/perl/lib directory.
In the latter case, they are:
perl/lib/Heap.pm, and
perl/lib/Heap/Fibonacci.pm.
cd local
ln -s <full_path_of_phrap> phrap
where <full_path_of_phrap> is where phrap is installed in your computer. For example, at HGSC, <full_path_of_phrap> is /home/hgsc/bin/phrap
Do the same for cross_match.
If you do not have phrap and/or cross_match, please see the webpage http://www.phrap.org for infomation about how to get a copy.
First you need to decide where to put the reads data, referred to as <reads_dir> (any name). Under
this directory, create a subdirectory original/ (no other names allowed). You then
cp ATLAS_ROOT/data/demo/* to <reads_dir>/original
Change to perl/bin directory, and then run the command
atlas-prep-reads -r <reads_dir>
where <reads_dir> is the directory where you will put your original and trimmed data in.
The results:
The data files we supplied are copied to <reads_dir>/original directory, where they are indexed. The trimmed files of the reads data are in the directory <reads_dir>/screened. 32mer analysis are done under <reads_dir>/screened/kmers, but the resulted kill-file is moved to<reads_dir>/screened.
Choose a directory for your assembly, <asm_dir> (any name), and run
atlas-asm-wgs -r <reads_dir> -d <asm_dir>
You'll get the following results in <asm_dir>:
For the ace file format, see http://www.phrap.org.
In the graph file, output is written in the form
SourceName QueryName1{QuerySpan,QueryScore,LeftExtension,RightExtension,SeedCopyCount} QueryName2...
For the format of scaffold files, see the notes for atlas-linearsequence.
For details, please refer to the paper by Havlak, et al.
Paul Havlak, Rui Chen, K. James Durbin, Amy Egan, Yanru Ren, Xing-Zhi Song, George M. Weinstock and Richard A. Gibbs, The Atlas Genome Assem bly System Genome Res. 14: 721-732
Now click here to download the archive of the above files.
IMPORTANT LIMITATIONS:
Query read sequences can not be longer than 4095bp. (2^12) There can not be more than 524,000 source reads. (2^19)
UNPREDICTABLE behavior will result if these limits are exceeded.
Uses a filtration technique to compare all of the sequences in a FASTA file to each other, identifying all of the overlaps between the sequences. Small samples are taken from each read and compared with the full sequence of all other reads using a fast exact match algorithm (hash table in current version; suffix trees have been used in other versions). Repeats are filtered out, de novo, in order to produce physical overlaps only. A banded alignment of all pairs that contain a common low-copy n-mer is performed to determine true overlap amounts to be reported in a graph file. The output will be in the form:
SourceName QueryName1{QuerySpan,QueryScore,LeftExtension,RightExtension,SeedCopyCount} QueryName2...
A single query file can be specified with the -q option (for backward compatibility), but multiple query files can also be added onto the line, with no arbitrary limit.
-s {SourceFile} | Name of source FASTA file. Hash will be built from source. |
-q {QueryFile} | Name of query FASTA file. Query reads will be sampled and samples compared to source hash. |
-Q {QueryFOF} | A file which lists the query files to use. Can be used in conjunction with -q or with a list of queries as free arguments. (The purpose of this option is to allow very long query lists). |
-o {OutFile} | Name of output file. This will be an adjacency list graph of the overlaps for each sequence in the source file. |
-v | Print version and version history. |
------ Misc -----------------------------
-b {Begin with} | Number of Sequence in source to start with. (0 based). |
-e {End with} | Number of Sequence in source to end with. The b and e options are for building the hash from a large FASTA file (e.g. > 50,000 reads), where the file has to be compared in chunks. Since it scans through the source until it reaches read b, this is not as efficient as creating several separate source files and running the program several separate times, but it is often more convenient. |
-Z | gzip the output graph. Default is not-gzipped. Output graphs can get quite large, especially now that multiple values are being reported for each edge. This option will save the graph output directly to a gzip file. NOTE: Does NOT automatically tack on .gz to file name. |
-F | Output format: Flip sense of overlap extensions so they are relative to the read that is origin of the edge (default relative to sink of edge). |
-f | Output format: Flip direction of edges to go from source read to query read (default is from query to source). |
----- Heuristics ------------------------------
-P | Prime numbered kill hash size. A good value will be at least 1.2 times the number of kill mers you have. Integer value selects one of the following (5 is default):
| |||||||||||||||||||||
-p | Directly specify prime numbered kill hash size (overrides -P). | |||||||||||||||||||||
-O {oligo size} | Size of k-mer to use for seeding banded alignments (4-32). Allow the sample length to be specified. Default 32bp. | |||||||||||||||||||||
-B {band size} | Band size for banded alignments (Default=2, which is +/- 2 from diagonal). Valid range: 1-25. | |||||||||||||||||||||
-R {cutoff mers) | This is the repeat cutoff specified as a number of mers, rather than as a number of standard deviations above the mean. Overrides -r option. Default is to use -r. | |||||||||||||||||||||
-r {RepeatThresh} | Cutoff for repeats. Samples occuring more than mean+r*sigma are ignored as repeat samples. This cutoff attempts to be a function of the actual coverage, since the mean number of occurences of a random sample will be close to the mean coverage, skewed high a bit by the presence of repeats. For low-coverage data (C < 2) r may need to be 2 or 3 since sigma is likely to be small. For high-coverage data, r closer to 0 is probably appropriate. I plan to make this more sophisticated when I get a chance so that the user doesn't have to think about it. Default = 1. | |||||||||||||||||||||
-K {RepeatFileName} | When specified a file of known repeat mers is read in and used to filter repeats. File lists mer, a tab, and mer count on each line. K for kill list. | |||||||||||||||||||||
-k {MinRepeatCont} | When reading in -K repeat file, save in RAM only those mers with count >= MinRepeatCount. | |||||||||||||||||||||
-Y {MaxOverlapSeed} | Max seed. Largest mer count that will be used to seed an overlap. This is analgous to the -R option, except that it is based on the mer frequencies reported in RepeatFile (-K), which are globally derived. This option works before the -R option, so the -R stats will be computed on the mers that remain. | |||||||||||||||||||||
-H {HashMerLimit} | This is another form of repeat cutoff. The purpose of this limit is to constrain the size of the hash. Basically, HashMerLimit is a hard limit on the number of locations that will be saved for any mer. It should be set to something just slightly larger than what you expect mean+r*sigma above to be. 100 or 1000 are pretty safe values in most cases. The smaller this number, the more reads can be put into each hash, and so quicker a set of jobs will go. Default value is 1000. | |||||||||||||||||||||
-m {MaxMismatch%} | Maximum mismatch percent to accept in the algined region. Edges which reflect more than this fraction of mismatches will not be reported. This is translated into a score cutoff as follows. Since each match is +1 and each mismatch is -1 (or -2 for indel, which we ignore here), there is a double penalty to a mismatch so far as the score is concerned (no +1 for the match, and a -1 penalty). So, for a particular span, the score that is reflected by a x-1.995869raction of mismatches will be score=(1-(x/100))*span - (x/100)*span = (1-(2x/100))* span. So the cutoff is simply cutoff = (1-(2x/100))*span. If score < cutoff, the edge is omitted from the graph. Default value is 2 (i.e. 2%). | |||||||||||||||||||||
-M {Min Read Size} | Minimum size of reads to consider. Default is 100bp. | |||||||||||||||||||||
-S {SamplesPerRead} | Samples per read. Code tries to take this many samples, but read may be too short in which case it takes as much as it can. | |||||||||||||||||||||
-I {SampleInterior} | Samples across entire read. Default is end sampling only, where SamplesPerRead/2 samples are taken from each end of the read. End sampling typically gives you more sensitivity per sample since we are looking for end-to-end matches, but there arise situations where it is necessary to look at the interior of reads also. | |||||||||||||||||||||
-G {SampleGap} | Step between samples. Default is same as sample size (-O), for perfectly tiled samples (barring Ns). |
--- Experimental options -----
-x {BucketSize} | Set the initial hash bucket size. |
Written by: James Durbin (kdurbin@bcm.tmc.edu)
Version %s1.70
Reads in an asymmetric directed graph, with each possible source node on a separate line, followed by its whitespace-delimited outedges. Each outedge is represented as a pair, nodename whitespace weight, where the weight is between 1 and 99. Output is one file of read names (one read per line) per bin. Core reads of a bin occur in only one bin; but peripheral reads may occur in multiple bins. Bins are named prefixNNNNsuffix, where prefix0000suffix is the default bin for reads not placed elsewhere [Defaults are given in square braces.] [$Revision: #2 $]
Option values are:
-f {GraphFile} | [] Name of graph file (empty => cin). |
-p {BinPrefix} | [bin] Prefix for bin file names. |
-s {BinSuffix} | [.fon] Suffix for bin file names. |
+A | [-] Boolean, put all sinks of core edges in one bin if true. |
-i {Ignore} | [35] Integer, minimum below which edge weight is ignored. |
-c {Coverage} | [6] Floating-point, estimated coverage of input graph file. This implies -l (3 * coverage) and -r (2 * coverage) unless alternative values for those options are given. |
-l {FrontierLmt} | [-1] Integer, maximum size of the frontier in BFS. |
-L {FrontierLmt} | [-1] Integer. If this is different from -l, all values between -l and -L will be used. |
-r {RepeatEdge} | [-1] Integer, max edges before node treated as repeat. This is used in BAC-fishing mode only. |
-k {KillRepeat} | [12] Integer, kill overlap edges seeded by nmer with more than this many copies. |
+S | [-] Boolean, consider extensions relative to the sink (otherwise relative to origin). |
+E | [-] Boolean, count left and right neighbors separtely against RepeatEdge count |
-n {EstNodes} | [262144] Integer, estimate max number nodes for efficiency's sake. |
-m {MinBinSize} | [10] Integer, minimum number of nodes per output bin. |
-x {MaxBinSize} | [99999] Integer, maximum number of nodes per output bin. |
-e {MaxBinDist} | [10000] Integer, maximum distance from start of bin in edges. |
+w | [+] Boolean, use estimated-exact-span weighting. |
-d {DebugString} | [] Each character indicates a debugging option to turn on;
'+' indicates turn on all debugging. |
atlas-count-batch K HASHSIZE SLICING SLICE MINREPORT
Arguments are:
K | kmer size, in our demonstration K=32 |
HASHSIZE | A large prime for hash table size (genomesize*1.2)/slicing factor < hash table size < (available RAM in bytes)/10 |
SLICING | hash slicing factor: 1/SLICING of k-mers will be tabulated, thus SLICING number of jobs will be needed to get the complete k-mer table. |
SLICE | which job this is of several jobs (0, ..., SLICING-1) to get complete table. |
MINREPORT | print k-mers with frequencies >=MINREPORT; print only distribution if MINREPORT=0 |
Reads in two fasta sequence files (masked and unmasked) and a quality file for a set of reads to be trimmed. Also requires a file of vector sequences to be agressively trimmed (any 10bp match) at the beginnings of reads. Looks for the first and last passing windows and masks any leading or trailing junk. Use with '-s v' to screen for Phrap without quality trimming.
Boolean options are specified with + to turn on, - to turn off (e.g., -x)
Default Option values are:
-O {string} | [] suffix to append to arg for opening original (unscreened) sequence file |
-S {string} | [] suffix to append to arg for opening screened sequence file |
-Q {string} | [.qual] suffix to append to arg for opening quality file |
-p {phredscore} | [20] minimum phred quality score for good bases |
-l {length} | [100] minimum number of good bases in insert for passed read |
-w {length} | [50] trim read from first passing window of this size to last |
-P {length} | [50] look extra hard for vector in this many bases at front of read |
-e {error_rate} | [1.25] maximum expected error rate for a passing window |
-k or +k | [+] trim to window bases with qual >= 20 |
-x or +x | [-] + means trim/screen by hard masking (Xs) instead of soft (lowercase) |
-s {char} | [b] v = vector only, q = quality only, b = both |
-f or +f | [+] create sequence file for processed reads |
-q or +q | [+] create quality file for processed reads |
-o or +o | [+] include all reads (not just passed) in output files |
-t or +t | [-] trim by deleting bases outside windows instead of masking |
-V {string} | [] path of file to read vector sequence from (for screening first 50 bases) |
-P {length} | [50] look extra hard for vector in this many bases at front of read |
-d {DebugString} | [v] Each character indicates a debugging option to turn on; '+' indicates turn on all debugging. 'v' gives some statistics 'V' gives lots of statistics |
-v or +v | Print RCS version number |
-h | print usage and option descriptions |
For example,
atlas-screen-window +t -q -d 'Vo' foo.fa bar.fa
Backatlas-splitbadcontigs -a AceIn -q ReadQualityIn -o AceOut {options}
A moving average of the template coverage is computed starting at the ends and going inward. The window size for this moving average is given by trimWindow, and is expressed in base pairs. The trim cutoff is specified as the first place where the moving average of the template coverage is greater than trimHeight. The contig is split into two contigs around this 'template crash'. Reads which overlap the left and right boundary of the crash region go to the left or right if they are anchored there by a matched mate, but otherwise are simply dropped.
When a ProblemFile is given with the -p option, a totally different set of more subtle split rules are used that aren't just based on template crashes.
-a | AceFile |
-q | Read qualities (for new consensus computation). |
-o | Ace output file name. |
-r | TrimmedReadListOut |
-w | trimWindow (for default template crash mode, i.e. not used with -p) |
-c | trimCoverage (for default templat crash mode, i.e. not used with -p) |
-g | graph output which includes Rep_node: specifiers. When given only nodes listed in the Rep_node: list will be trimmed, others will be left alone. This option excludes -p. With this option, splits are on template crashes only. |
-p | Problem file. This is a file that identifies regions of suspicious patterns of external links. These are probably problems, so if at all possible, a split will be done on each of these problems. This option excludes -g. With this option, a series of rules are applied to determine if the suggested problem area is real and, if so, how to split. |
-M | Minimum size of zero template coverage contig to keep after splitting. |
atlas-trimphraptails -a AceIn -q ReadQualities -o AceOut {options}
A moving average of the template coverage is computed starting at the ends and going inward. The window size for this moving average is given by trimWindow, and is expressed in base pairs. The trim cutoff is specified as the first place where the moving average of the template coverage is greater than trimHeight.
-a | AceFile |
-q | Read qualities (for new consensus computation). |
-o | Ace output file name. |
-r | TrimmedReadListOut |
-w | trimWindow |
-c | trimCoverage |
-g | graph output which includes Rep_node: specifiers. When given only nodes listed in the Rep_node: list will be trimmed, others will be left alone. Unlike previous versions, starting with version 1.1 'node' is taken to be just one side of a contig. So if Contig63_right is the only rep node, only the right half of Contig63 will be trimmed. |
-M | Minimum size of trimmed contig to keep. If either (or both) of contig and trim portion of contig are smaller than this, they are dropped. |
atlas-refsequence -a AceFile -g ScaffoldGraph -o Output {Options}
Reads in an ace file and an associated graph file and produces a reference sequence. Sequence is written out in the relative order and orientation given by the graph file. Gaps between ordered and oriented sequences are padded with Ns. In SingleSequence mode Gaps between unordered sequences are padded with Xs:
> Scaffold0
sssssNNNNNsssssNNNNssssXXXXXsssssNNNNsssssXXXssssXXXssssXXX
In multiple-sequence mode, each scaffold is one seqquence like:
>Scaffold0
sssssNNNNNsssssNNNNssss
>Scaffold1
sssssNNNNsssss
>UnscaffoldedContig2
ssss
>UnscaffoldedContig3
ssss
Output is to a file that starts with ERROR if the sequence ends up with blocks of > 50K Ns.
-a | Ace file name. |
-g | Scaffold .graph file name. |
-o | Output file name. |
-p | Project name to append to _Scaffold1, etc. |
-q | Also write quality file out to Output.qual |
-S | Enable single sequence mode. Default is multiple sequence. |
-m | Multiple *file* mode. Each scaffold written to it's own file. |
-F | Use fixed number of Ns to pad between scaffolds rather than based on distance estimate. |
-s | (15000) Minimum size of scaffold to output. |
-U | Include unscaffolded contigs (that meet criteria for unscaffolded contigs) |
-M | Minimum size for BAC only unscaffolded contigs. |
-Q | (1) Quality trim for contigs. |
-X | (100) Number of X's to pad between unordered sequences. |
-N | (50) Minimum number of Ns to pad between ordered seqs. |
-G | MaxGapSize to allow in output. Reports an error and writes output to .FAIL file if gap greater than this in output. Default is disabled. |
-O | MaxOutputSize. Reports an error and writes output to .FAIL file if the total sequence size exceeds this ammount. Default is disabled. |
-D | Show debugging output (Contig and gap structure printed). |
-I | Allows just one fasta input file to be specified. |
-b | FOF listing the bin file names. Union with command line bin1, bin2, etc. to get |
Extracts reads specified in bin file (or set of binfiles) from a set of indexed FASTA files (the indices are created by atlas-createindex). Takes a list of the FASTA files indexed (NOT the index names, the original FASTA file names) and a list of bin file names. It tries to decipher the corresponding quality file name and looks to see if that file exists and if it does adds it to the list of files to be extracted. Some of the transformations it will try include:
read1.fa | -> read1.qual |
read1.fasta | -> read1.qual |
read1.fa.orig | -> read1.fa.qual (where orig can be anything). |
read1.fasta.orig | -> read1.fa.qual |
fasta.read1.wgs | -> qual.read1.wgs |
fa.read1.wgs | -> qual.read1.wgs |
read1 | -> read1.qual |
The outfile name for each bin is binfilename_out.fa and binfilename_out.fa.qual. If no rootpaht is given the output will be in the current directory.
NOTE: These output file are written in an append mode!
Written by: James Durbin kdurbin@bcm.tmc.edu
Original Perl version by: Ray Chen ruichen@bcm.tmc.edu