Atlas-Indel2 for Illumina Exome Capture Data version 0.1, December 2010 AUTHORS: Danny Challis, Uday Evani, Fuli Yu, Human Genome Sequencing Center, Baylor College of Medicine CONTACT: challis@bcm.edu, fyu@bcm.edu Atlas-Indel2 is a suite of indel calling tools based on logistic regression models made up of pertinent variables sequencing data. This branch of Atlas-Indel2 is trained for Illumina Exome Capture data. SYSTEM REQUIREMENTS: * Unix-like operation systems * Ruby 1.9.1: http://www.ruby-lang.org/en/downloads/ * SAMtools must be installed and runnable by invoking the "samtools" command -SAMtools may be obtained at http://samtools.sourceforge.net/ LICENSE: This software is free for all uses with the following restrictions: * No part of this software, or modifications thereof, may be redistributed for any purpose to any other company, person, or individual, without prior written permission from the author. * This software is provided AS IS. Baylor College of Medicine assumes no responsibility or liability for damages of any kind that may result, directly or indirectly, from the use of this software. * The above copyright notice must be preserved in the executable About dialog or made visible to users in some other way. DATA PRE/POST PROCESSING REQUIREMENTS: We highly recommend local realignment around indels and high variation regions using GATK or other third party tools. While you may run Atlas-Indel2 without local realignment, it is designed to work with locally realigned data and much greater sensitivity is possible with it. Details on local realignment using GATK can be found at http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels. The BAM files must be sorted in the same order as the reference genome. Currently Atlas-Indel2 does not filter results to the capture target region. You may use tools such as VCFtools (http://vcftools.sourceforge.net) to filter your results to a .bed file. example pipeline: *map fastq files using BWA to get BAM files *sort BAM files *locally realign BAMs using GATK local realigner *run Atlas-Indel2 on BAMs to get individual VCF files *merge VCF files using vcfPrinter (included) *filter VCF file to target region using VCFtools and .bed file USAGE: Atlas-Indel2-Illum-Exome.rb -b input_bam -r reference_sequence -o outfile optional arguments: -s --sample [by default taken from infile name] (the name of the sample) -z --z-cutoff [-1] (the z-score cutoff for the regression model) -t --min-total-depth [4] (may not be lower than 4) -m --min-var-reads [1] -v --min-var-ratio [0.05] -f --strand-dir-filter (requires at least one read in each direction, extremely limits sensitivity) This usage information can be viewed at any time by running the program without arguments. Mandatory arguments: -b, --bam=FILE The input BAM file. It must must be sorted. It does not need to be indexed. A read mask of 1796 is used on the bitwise flag. -r, --reference=FILE The reference sequence to be used in FASTA format. Obviously this must be the same version used in mapping the sequence. It also must have its chromosomes sorted in the same order as the input BAM file. It does not need to be indexed. -o, --outfile=FILENAME The name of the output VCF file. Output is a bare-bones VCFv4 file with a single sample. These files can be merged into a more complete VCF file using the vcfPrinter (included). If the file already exists, it will be overwritten. NOTE: For use with vcfPrinter, you should name your vcf file the same as your BAM file, simply replacing ".bam" with ".vcf" Optional arguments: -s, --sample=STRING The name of the sample to be listed in the output VCF file. If not specified the sample name is harvested from the input BAM file name, taking the first group of characters before a . (dot) is found. For example, with the filename "NA12275.chrom1.bam" the sample name would be "NA12275". -z, --z-cutoff=FLOAT Default=-1.0 The z-score cutoff value for the logistic regression model. Indels with a z-score less than this cutoff will not be called. Increasing this cutoff will increase specificity, but will lower sensitivity. Suggested range: -0.5 to -4.0 Heuristic Cutoffs: Most of these variables have already been considered by the regression model, so you shouldn't usually need to alter them. However you are free to change them to meet your specific project requirements. -t, --min-total-depth=INT Default=4 The minimum total depth coverage required at an indel site. Indels at a site with less depth coverage will not be called. This cutoff may not be set lower than 4. Increasing this value will increase specificity, but lower sensitivity. Suggested range: 4-12 -m, --min-var-reads=INT Default=1 The minimum number of variant reads required for an indel to be called. Increasing this number may increase specificity but will lower sensitivity. Suggested range: 1-5 -v, --min-var-ratio=FLOAT Default=0.05 The variant-reads/total-reads cutoff. Indels with a ratio less than the specified value will not be called. Increasing this value may increase specificity, but will significantly lower sensitivity. Suggested range: 0-0.15 -f, --strand-dir-filter Default=disabled When included, requires indels to have at least one variant read in each strand direction. This filter is effective at increasing the specificity, but also carries a heavy sensitivity cost. EXAMPLES: ruby Atlas-Indel2-Illum-Exome.rb -b NA12275.chrom1.bam -r ~/refs/human_g1k_v37.fasta -o ~/NA12275.chrom1.vcf -z -2 ruby Atlas-Indel2-Illum-Exome.rb -b seq1.10.2010.chrom1.bam -r ~/refs/human_g1k_v37.fasta -o ~/NA12275.chrom1.vcf -t 10 -m 2 -s NA12275