Atlas Whole Genome Assembly Suite Atlas is a collection of software tools to facilitate the assembly of large genomes from whole genome shotgun reads, or a combination of whole genome shotgun reads and BAC or other localized reads. This suite of tools has been used in the assemblies of the rat ( Rattus norvegicus), fruit fly (D. pseudoobscura), honey bee ( Apis mellifera), sea urchin (Strongylocentrotus purpuratus), bovine (Bos taurus), and other species. This is a new release of the Atlas software. Some components of the software used to produce the Rat assembly are not yet packaged for distribution. These will be added shortly. While we anticipate no problems running this software on different platforms, this software has not yet been extensively tested off-site. The older 2004 version of the Atlas software distribution is still available for the convenience of those who licensed that version. If you did not download it before, please don't download it now. What we have in the package Documents Binaries: Binaries are available for linux and Sun Solaris Unix. A Mac OS X version will be available soon. They are packed in separate gzipped tar files atlas-linux.tgz and atlas-solaris.tgz. atlas-overlapper Compares reads in a query file and a target file for overlaps. The FASTA sequences should be of roughly read length (up to 1000 bp) and should be trimmed for sequence and quality. The overlap output is in graph form. Run atlas-overlapper -help for options information. atlas-binner Takes overlap graph and produces bins of read names according to mutual overlaps. Run atlas-binner -help for options information. atlas-count-kmers Counts all subsequences of length k, for a set of reads, gives the repeat frequency for each of them, and the frequency distribution of distinct k-mers. Click this for detail. atlas-screen-window Reads in two fasta sequence files (masked and unmasked) and a quality file for a set of reads to be trimmed. Looks for the first and last passing windows and masks any leading or trailing junk. Run atlas-screen-window -help for options information. atlas-splitbadcontigs Identifies misjoins within contigs and split them. Options are given here. atlas-trimphraptails Removes reads from contig ends where they conflict with scaffolding and lack a read pair mate inside the contig. For options, click here. atlas-linearsequence Reads in an ace file and an associated graph file and produces sequences. For details, see the notes. atlas-extractbins Extracts reads specified in bin file (or set of binfiles) from a set of indexed FASTA files. For detailes, see the help message . Perl scripts atlas-prep-reads Prepares reads for the ATLAS assembly, a wrapper. It makes several system calls for various binary executables and perl scripts, also uses Atlas::PrepReads perl module to do the following: calling atlas-createindex to create indices for the sequence and quality files for the original data reads; calling atlas-divide-fafile to divide large files into sub-files if neccessary, in order to run atlas-screen-trim-file; calling atlas-screen-trim-file to create trimmed reads file; calling atlas-count-kmers and then analyze distribution (find out cutoff repeat frequency, choose cutoff = (2 or 3)*coverage, and create a kill-file for atlas-overlapper. atlas-createindex creates index files for sequence (fasta) and quality files. atlas-divide-fafile divides large files (more than 20,000 reads per fasta file) into sub-files if neccessary, in order to run atlas-screen-trim-file. atlas-screen-trim-file calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); atlas-asm-wgs Runs atlas assembly for WGS reads as prepared by atlas-prep-reads. This is the second wrapper for the package. It does the following: running atlas-overlapper for all reads vs all; runnnig atlas-binner to group reads into bins for localized assembly; calling atlas-separate-bin-assemble to separate reads from overall file of binsreads into bins and put them in a directory tree, and run local phrap assembly for each bin; using atlas-build-scaffold-file and perl module Atlas::AsmWgs to combine and scaffold the bin assemblies. atlas-separate-bin-assemble separate reads from overall file of binsreads into bins and put them in a directory tree, extract reads from the WGS reads pool (using atlas-extractbins on indices created by atlas-createindex), screen for vectors (using cross_match), and run phrap assembly for each bin atlas-build-scaffold-file builds scaffolds based on an ace file You need to edit the above four perl scripts according to where you install your perl binary in your system. See the next section A Quick Tour for detail.Perl modules Atlas Atlas/PrepReads.pm Atlas/AsmWgs.pm Atlas/Scaffold.pm Atlas/ScaffoldHeapEle.pm Atlas/Project/Contig.pm Atlas/Project/Trace.pm Atlas/Utility/ObjectAttribute.pm Subroutines and variable for atlas-prep-reads, atlas-asm-wgs, and atlas-build-scaffold-file. Reads used for demonstration demo.fa demo.fa.qual Sequences taken from a BAC, and some WGS sequences used in the demonstration of an Atlas run. univec.fa a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process. It was developed by staff at the National Center for Biotechnology Information, part of the National Library of Medicine at the National Institutes of Health. A detailed description of UniVec is available on the web: http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html A Quick Tour After downloading the file atlas-linux.tgz, or atlas-solaris.tgz, or atlas-osx.tgz (available soon) for the correct operating system, move it to a directory where you want to install the atlas assembly package, then unpack the file by typing the command: tar xvzf atlas2005-linux.tgz, or tar xvzf atlas2005-solaris.tgz. You will get the following subdirectories and files: documents/ documents/readme.html documents/graphics/atlaslogo.gif bin/ bin/atlas-overlapper bin/atlas-splitbadcontigs bin/atlas-screen-window bin/atlas-binner bin/atlas-trimphraptails bin/atlas-linearsequence bin/atlas-count-kmers bin/atlas-extractbins local/ perl/ perl/bin/ perl/bin/atlas-asm-wgs perl/bin/atlas-build-scaffold-file perl/bin/atlas-createindex perl/bin/atlas-divide-fafile perl/bin/atlas-prep-reads perl/bin/atlas-screen-trim-file perl/bin/atlas-separate-bin-assemble perl/lib/ perl/lib/Atlas/ perl/lib/Atlas/AsmWgs.pm perl/lib/Atlas/PrepReads.pm perl/lib/Atlas/Scaffold.pm perl/lib/Atlas/ScaffoldHeapEle.pm perl/lib/Atlas/Utility/ perl/lib/Atlas/Utility/ObjectAttribute.pm perl/lib/Atlas/Project/ perl/lib/Atlas/Project/Trace.pm perl/lib/Atlas/Project/Contig.pm data/ data/univec/ data/univec/README.uv data/univec/univec.fa data/demo/ data/demo/demo.001.fa data/demo/demo.001.fa.qual We assume that the perl program in your system is /usr/bin/perl. If it is not, you need to edit the first line (#!/usr/bin/perl) accordingly for all perl scripts in the perl/bin/ subdirectory. Set your environment parameter for ATLAS_ROOT. This should be the the directory where you have already unpacked and installed the atlas package. In bash shell, use the command export ATLAS_ROOT=xxxx You may want to put this line in the .bashrc file in your home directory. In C shell, use the command setenv ATLAS_ROOT xxxx And the equivalent file is .cshrc. In the Scaffolder, we need to use perl modules Heap and Heap::Fibonacci, which are not a part of the standard perl library. Check to make sure you have them in your system. If not, you can get them from CPAN (Comprehensive Perl Archive Network). You can either install them in the default perl library directory in your system, or in the ATLAS_ROOT/perl/lib directory. In the latter case, they are: perl/lib/Heap.pm, and perl/lib/Heap/Fibonacci.pm. We use cross_match and phrap as the screener and assembler. Your need to make a symbolic (soft) link from local/phrap to the phrap executable in your computer. Please do the following: cd local ln -s phrap where is where phrap is installed in your computer. For example, at HGSC, is /home/hgsc/bin/phrap Do the same for cross_match. If you do not have phrap and/or cross_match, please see the webpage http://www.phrap.org for infomation about how to get a copy. Now you are ready to test-run atlas assembly with the data we supplied. Run atlas-prep-reads First you need to decide where to put the reads data, referred to as (any name). Under this directory, create a subdirectory original/ (no other names allowed). You then cp ATLAS_ROOT/data/demo/* to /original Change to perl/bin directory, and then run the command atlas-prep-reads -r where is the directory where you will put your original and trimmed data in. The results: The data files we supplied are copied to /original directory, where they are indexed. The trimmed files of the reads data are in the directory /screened. 32mer analysis are done under /screened/kmers, but the resulted kill-file is moved to/screened. Run atlas-asm-wgs Choose a directory for your assembly, (any name), and run atlas-asm-wgs -r -a You'll get the following results in : atlas.graph: result of atlas-overlapper atlas.fon: result of atlas-binner, a list of reads bundled into bins local assemblies for each bin are located in a directory tree atlas.ace and atlas.contigs, atlas.contigs.qual: combined ace and contig files for all bin-assembly results atlas.scaffold: result of scaffoding atlas.linear.fa, atlas.linear.fa.qual, atlas.linear.fa.scaffold: linearized scaffold files For the ace file format, see http://www.phrap.org. In the graph file, output is written in the form SourceName QueryName1{QuerySpan,QueryScore,LeftExtension,RightExtension,SeedCopyCount} QueryName2... For the format of scaffold files, see the notes for atlas-linearsequence. For details, please refer to the paper by Havlak, et al. Reference Paul Havlak, Rui Chen, K. James Durbin, Amy Egan, Yanru Ren, Xing-Zhi Song, George M. Weinstock and Richard A. Gibbs, The Atlas Genome Assem bly System Genome Res. 14: 721-732 Notes: atlas-overlapper -help atlas-overlapper -s {SourceFile} -q {QueryFile} -o {OutFile} [{heurisitc options}] {query2 query3 query4 ....} IMPORTANT LIMITATIONS: Query read sequences can not be longer than 4095bp. (2^12) There can not be more than 524,000 source reads. (2^19) UNPREDICTABLE behavior will result if these limits are exceeded. Uses a filtration technique to compare all of the sequences in a FASTA file to each other, identifying all of the overlaps between the sequences. Small samples are taken from each read and compared with the full sequence of all other reads using a fast exact match algorithm (hash table in current version; suffix trees have been used in other versions). Repeats are filtered out, de novo, in order to produce physical overlaps only. A banded alignment of all pairs that contain a common low-copy n-mer is performed to determine true overlap amounts to be reported in a graph file. The output will be in the form: SourceName QueryName1{QuerySpan,QueryScore,LeftExtension,RightExtension,SeedCopyCount} QueryName2... A single query file can be specified with the -q option (for backward compatibility), but multiple query files can also be added onto the line, with no arbitrary limit. -s {SourceFile} Name of source FASTA file. Hash will be built from source. -q {QueryFile} Name of query FASTA file. Query reads will be sampled and samples compared to source hash. -Q {QueryFOF} A file which lists the query files to use. Can be used in conjunction with -q or with a list of queries as free arguments. (The purpose of this option is to allow very long query lists). -o {OutFile} Name of output file. This will be an adjacency list graph of the overlaps for each sequence in the source file. -v Print version and version history. ------ Misc ----------------------------- -b {Begin with} Number of Sequence in source to start with. (0 based). -e {End with} Number of Sequence in source to end with. The b and e options are for building the hash from a large FASTA file (e.g. > 50,000 reads), where the file has to be compared in chunks. Since it scans through the source until it reaches read b, this is not as efficient as creating several separate source files and running the program several separate times, but it is often more convenient. -Z gzip the output graph. Default is not-gzipped. Output graphs can get quite large, especially now that multiple values are being reported for each edge. This option will save the graph output directly to a gzip file. NOTE: Does NOT automatically tack on .gz to file name. -F Output format: Flip sense of overlap extensions so they are relative to the read that is origin of the edge (default relative to sink of edge). -f Output format: Flip direction of edges to go from source read to query read (default is from query to source). ----- Heuristics ------------------------------ -P Prime numbered kill hash size. A good value will be at least 1.2 times the number of kill mers you have. Integer value selects one of the following (5 is default): 0 357913931 Roughly 3 GB of space. 1 101000777 Roughly 900MB of space. 2 79999987 Roughly 800 MB of space. 3 59999999 687 MB 4 49999991 429 MB 5 39999943 340 MB 6 29999947 258MB -p Directly specify prime numbered kill hash size (overrides -P). -O {oligo size} Size of k-mer to use for seeding banded alignments (4-32). Allow the sample length to be specified. Default 32bp. -B {band size} Band size for banded alignments (Default=2, which is +/- 2 from diagonal). Valid range: 1-25. -R {cutoff mers) This is the repeat cutoff specified as a number of mers, rather than as a number of standard deviations above the mean. Overrides -r option. Default is to use -r. -r {RepeatThresh} Cutoff for repeats. Samples occuring more than mean+r*sigma are ignored as repeat samples. This cutoff attempts to be a function of the actual coverage, since the mean number of occurences of a random sample will be close to the mean coverage, skewed high a bit by the presence of repeats. For low-coverage data (C -K {RepeatFileName} When specified a file of known repeat mers is read in and used to filter repeats. File lists mer, a tab, and mer count on each line. K for kill list. -k {MinRepeatCont} When reading in -K repeat file, save in RAM only those mers with count >= MinRepeatCount. -Y {MaxOverlapSeed} Max seed. Largest mer count that will be used to seed an overlap. This is analgous to the -R option, except that it is based on the mer frequencies reported in RepeatFile (-K), which are globally derived. This option works before the -R option, so the -R stats will be computed on the mers that remain. -H {HashMerLimit} This is another form of repeat cutoff. The purpose of this limit is to constrain the size of the hash. Basically, HashMerLimit is a hard limit on the number of locations that will be saved for any mer. It should be set to something just slightly larger than what you expect mean+r*sigma above to be. 100 or 1000 are pretty safe values in most cases. The smaller this number, the more reads can be put into each hash, and so quicker a set of jobs will go. Default value is 1000. -m {MaxMismatch%} Maximum mismatch percent to accept in the algined region. Edges which reflect more than this fraction of mismatches will not be reported. This is translated into a score cutoff as follows. Since each match is +1 and each mismatch is -1 (or -2 for indel, which we ignore here), there is a double penalty to a mismatch so far as the score is concerned (no +1 for the match, and a -1 penalty). So, for a particular span, the score that is reflected by a x-1.995869raction of mismatches will be score=(1-(x/100))*span - (x/100)*span = (1-(2x/100))* span. So the cutoff is simply cutoff = (1-(2x/100))*span. If score -M {Min Read Size} Minimum size of reads to consider. Default is 100bp. -S {SamplesPerRead} Samples per read. Code tries to take this many samples, but read may be too short in which case it takes as much as it can. -I {SampleInterior} Samples across entire read. Default is end sampling only, where SamplesPerRead/2 samples are taken from each end of the read. End sampling typically gives you more sensitivity per sample since we are looking for end-to-end matches, but there arise situations where it is necessary to look at the interior of reads also. -G {SampleGap} Step between samples. Default is same as sample size (-O), for perfectly tiled samples (barring Ns). --- Experimental options ----- -x {BucketSize} Set the initial hash bucket size. Written by: James Durbin (kdurbin@bcm.tmc.edu) Version %s1.70 Backatlas-binner -help Reads in an asymmetric directed graph, with each possible source node on a separate line, followed by its whitespace-delimited outedges. Each outedge is represented as a pair, nodename whitespace weight, where the weight is between 1 and 99. Output is one file of read names (one read per line) per bin. Core reads of a bin occur in only one bin; but peripheral reads may occur in multiple bins. Bins are named prefixNNNNsuffix, where prefix0000suffix is the default bin for reads not placed elsewhere [Defaults are given in square braces.] [$LastChangedRevision: 5855 $] Option values are: -f {GraphFile} [] Name of graph file (empty => cin). -p {BinPrefix} [bin] Prefix for bin file names. -s {BinSuffix} [.fon] Suffix for bin file names. +A [-] Boolean, put all sinks of core edges in one bin if true. -i {Ignore} [35] Integer, minimum below which edge weight is ignored. -c {Coverage} [6] Floating-point, estimated coverage of input graph file. This implies -l (3 * coverage) and -r (2 * coverage) unless alternative values for those options are given. -l {FrontierLmt} [-1] Integer, maximum size of the frontier in BFS. -L {FrontierLmt} [-1] Integer. If this is different from -l, all values between -l and -L will be used. -r {RepeatEdge} [-1] Integer, max edges before node treated as repeat. This is used in BAC-fishing mode only. -k {KillRepeat} [12] Integer, kill overlap edges seeded by nmer with more than this many copies. +S [-] Boolean, consider extensions relative to the sink (otherwise relative to origin). +E [-] Boolean, count left and right neighbors separtely against RepeatEdge count -n {EstNodes} [262144] Integer, estimate max number nodes for efficiency's sake. -m {MinBinSize} [10] Integer, minimum number of nodes per output bin. -x {MaxBinSize} [99999] Integer, maximum number of nodes per output bin. -e {MaxBinDist} [10000] Integer, maximum distance from start of bin in edges. +w [+] Boolean, use estimated-exact-span weighting. -d {DebugString} [] Each character indicates a debugging option to turn on; '+' indicates turn on all debugging. Backatlas-count-batch atlas-count-batch K HASHSIZE SLICING SLICE MINREPORT Arguments are: K kmer size, in our demonstration K=32 HASHSIZE A large prime for hash table size (genomesize*1.2)/slicing factor SLICING hash slicing factor: 1/SLICING of k-mers will be tabulated, thus SLICING number of jobs will be needed to get the complete k-mer table. SLICE which job this is of several jobs (0, ..., SLICING-1) to get complete table. MINREPORT print k-mers with frequencies >=MINREPORT; print only distribution if MINREPORT=0 Backatlas-screen-window -help Reads in two fasta sequence files (masked and unmasked) and a quality file for a set of reads to be trimmed. Also requires a file of vector sequences to be agressively trimmed (any 10bp match) at the beginnings of reads. Looks for the first and last passing windows and masks any leading or trailing junk. Use with '-s v' to screen for Phrap without quality trimming. Boolean options are specified with + to turn on, - to turn off (e.g., -x) Default Option values are: -O {string} [] suffix to append to arg for opening original (unscreened) sequence file -S {string} [] suffix to append to arg for opening screened sequence file -Q {string} [.qual] suffix to append to arg for opening quality file -p {phredscore} [20] minimum phred quality score for good bases -l {length} [100] minimum number of good bases in insert for passed read -w {length} [50] trim read from first passing window of this size to last -P {length} [50] look extra hard for vector in this many bases at front of read -e {error_rate} [1.25] maximum expected error rate for a passing window -k or +k [+] trim to window bases with qual >= 20 -x or +x [-] + means trim/screen by hard masking (Xs) instead of soft (lowercase) -s {char} [b] v = vector only, q = quality only, b = both -f or +f [+] create sequence file for processed reads -q or +q [+] create quality file for processed reads -o or +o [+] include all reads (not just passed) in output files -t or +t [-] trim by deleting bases outside windows instead of masking -V {string} [] path of file to read vector sequence from (for screening first 50 bases) -P {length} [50] look extra hard for vector in this many bases at front of read -d {DebugString} [v] Each character indicates a debugging option to turn on; '+' indicates turn on all debugging. 'v' gives some statistics 'V' gives lots of statistics -v or +v Print RCS version number -h print usage and option descriptions For example, atlas-screen-window +t -q -d 'Vo' foo.fa bar.fa Backatlas-splitbadcontigs -help atlas-splitbadcontigs -a AceIn -q ReadQualityIn -o AceOut {options} A moving average of the template coverage is computed starting at the ends and going inward. The window size for this moving average is given by trimWindow, and is expressed in base pairs. The trim cutoff is specified as the first place where the moving average of the template coverage is greater than trimHeight. The contig is split into two contigs around this 'template crash'. Reads which overlap the left and right boundary of the crash region go to the left or right if they are anchored there by a matched mate, but otherwise are simply dropped. When a ProblemFile is given with the -p option, a totally different set of more subtle split rules are used that aren't just based on template crashes. -a AceFile -q Read qualities (for new consensus computation). -o Ace output file name. ---------------- -r TrimmedReadListOut -w trimWindow (for default template crash mode, i.e. not used with -p) -c trimCoverage (for default templat crash mode, i.e. not used with -p) -g graph output which includes Rep_node: specifiers. When given only nodes listed in the Rep_node: list will be trimmed, others will be left alone. This option excludes -p. With this option, splits are on template crashes only. -p Problem file. This is a file that identifies regions of suspicious patterns of external links. These are probably problems, so if at all possible, a split will be done on each of these problems. This option excludes -g. With this option, a series of rules are applied to determine if the suggested problem area is real and, if so, how to split. -M Minimum size of zero template coverage contig to keep after splitting. Backatlas-trimphraptails -help atlas-trimphraptails -a AceIn -q ReadQualities -o AceOut {options} A moving average of the template coverage is computed starting at the ends and going inward. The window size for this moving average is given by trimWindow, and is expressed in base pairs. The trim cutoff is specified as the first place where the moving average of the template coverage is greater than trimHeight. -a AceFile -q Read qualities (for new consensus computation). -o Ace output file name. ---------------- -r TrimmedReadListOut -w trimWindow -c trimCoverage -g graph output which includes Rep_node: specifiers. When given only nodes listed in the Rep_node: list will be trimmed, others will be left alone. Unlike previous versions, starting with version 1.1 'node' is taken to be just one side of a contig. So if Contig63_right is the only rep node, only the right half of Contig63 will be trimmed. -M Minimum size of trimmed contig to keep. If either (or both) of contig and trim portion of contig are smaller than this, they are dropped. Back atlas-linearsequence -help atlas-refsequence -a AceFile -g ScaffoldGraph -o Output {Options} Reads in an ace file and an associated graph file and produces a reference sequence. Sequence is written out in the relative order and orientation given by the graph file. Gaps between ordered and oriented sequences are padded with Ns. In SingleSequence mode Gaps between unordered sequences are padded with Xs: > Scaffold0 sssssNNNNNsssssNNNNssssXXXXXsssssNNNNsssssXXXssssXXXssssXXX In multiple-sequence mode, each scaffold is one seqquence like: >Scaffold0 sssssNNNNNsssssNNNNssss >Scaffold1 sssssNNNNsssss >UnscaffoldedContig2 ssss >UnscaffoldedContig3 ssss Output is to a file that starts with ERROR if the sequence ends up with blocks of > 50K Ns. Back to atlas-demo quick tour -a Ace file name. -g Scaffold .graph file name. -o Output file name. -p Project name to append to _Scaffold1, etc. -q Also write quality file out to Output.qual -S Enable single sequence mode. Default is multiple sequence. -m Multiple *file* mode. Each scaffold written to it's own file. -F Use fixed number of Ns to pad between scaffolds rather than based on distance estimate. -s (15000) Minimum size of scaffold to output. -U Include unscaffolded contigs (that meet criteria for unscaffolded contigs) -M Minimum size for BAC only unscaffolded contigs. -Q (1) Quality trim for contigs. -X (100) Number of X's to pad between unordered sequences. -N (50) Minimum number of Ns to pad between ordered seqs. -G MaxGapSize to allow in output. Reports an error and writes output to .FAIL file if gap greater than this in output. Default is disabled. -O MaxOutputSize. Reports an error and writes output to .FAIL file if the total sequence size exceeds this ammount. Default is disabled. -D Show debugging output (Contig and gap structure printed). Back atlas-extractbins -help atlas-extractbins -i fastafileslist bin1 bin2 ... -I Allows just one fasta input file to be specified. -b FOF listing the bin file names. Union with command line bin1, bin2, etc. to get Extracts reads specified in bin file (or set of binfiles) from a set of indexed FASTA files (the indices are created by atlas-createindex). Takes a list of the FASTA files indexed (NOT the index names, the original FASTA file names) and a list of bin file names. It tries to decipher the corresponding quality file name and looks to see if that file exists and if it does adds it to the list of files to be extracted. Some of the transformations it will try include: read1.fa -> read1.qual read1.fasta -> read1.qual read1.fa.orig -> read1.fa.qual (where orig can be anything). read1.fasta.orig -> read1.fa.qual fasta.read1.wgs -> qual.read1.wgs fa.read1.wgs -> qual.read1.wgs read1 -> read1.qual The outfile name for each bin is binfilename_out.fa and binfilename_out.fa.qual. If no rootpaht is given the output will be in the current directory. NOTE: These output file are written in an append mode! Written by: James Durbin kdurbin@bcm.tmc.edu Original Perl version by: Ray Chen ruichen@bcm.tmc.edu Back