Atlas-Indel2 for Indel Calling in Whole Exome Capture Sequencing Data version 0.2, Feb 2011 AUTHORS: Danny Challis, Uday Evani, Fuli Yu, Human Genome Sequencing Center, Baylor College of Medicine CONTACT: challis@bcm.edu, fyu@bcm.edu Atlas-Indel2 is a suite of indel calling tools based on logistic regression models made up of pertinent variables sequencing data. This version of Atlas-Indel includes regression models for Illumina and SOLiD exome data. SYSTEM REQUIREMENTS: * Unix-like operation systems * Ruby 1.9.1: http://www.ruby-lang.org/en/downloads/ * SAMtools must be installed and runnable by invoking the "samtools" command -SAMtools may be obtained at http://samtools.sourceforge.net/ LICENSE: This software is free for all uses with the following restrictions: * No part of this software, or modifications thereof, may be redistributed for any purpose to any other company, person, or individual, without prior written permission from the author. * This software is provided AS IS. Baylor College of Medicine assumes no responsibility or liability for damages of any kind that may result, directly or indirectly, from the use of this software. * The above copyright notice must be preserved in the executable About dialog or made visible to users in some other way. DATA PRE/POST PROCESSING REQUIREMENTS: We highly recommend local realignment around indels and high variation regions using GATK or other third party tools. While you may run Atlas-Indel2 without local realignment, greater sensitivity is possible with it. Details on local realignment using GATK can be found at http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels. Currently Atlas-Indel2 does not filter results to the capture target region. You may use tools such as VCFtools (http://vcftools.sourceforge.net) to filter your results to a .bed file. example pipeline: *map fastq files using BWA to get BAM files *sort BAM files *locally realign BAMs using GATK local realigner *run Atlas-Indel2 on BAMs to get individual VCF files *merge VCF files using vcfPrinter (included) *filter VCF file to target region using VCFtools and .bed file USAGE: ruby Atlas-Indel2.rb -b input_bam -r reference_sequence -o outfile [-S or -I] -S --solid-exome-model (use the model trained for SOLiD exome data) -I --illumina-exome-model (use the model trained for Illumina exome data) optional arguments: -s --sample [by default taken from infile name] (the name of the sample) -z --z-cutoff (the z-score cutoff for the regression model) -t --min-total-depth (may not be lower than 4 for Illumina data) -m --min-var-reads -v --min-var-ratio -f --strand-dir-filter (requires at least one read in each direction, extremely limits sensitivity) This usage information can be viewed at any time by running the program without arguments. Mandatory arguments: -b, --bam=FILE The input BAM file. It must must be sorted. It does not need to be indexed. A read mask of 1796 is used on the bitwise flag. -r, --reference=FILE The reference sequence to be used in FASTA format. This must be the same version used in mapping the sequence. It does not need to be indexed. -o, --outfile=FILENAME The name of the output VCF file. Output is a bare-bones VCFv4 file with a single sample. These files can be merged into a more complete VCF file using the vcfPrinter (included). If the file already exists, it will be overwritten. NOTE: For use with vcfPrinter, you should name your vcf file the same as your BAM file, simply replacing ".bam" with ".vcf" -I or -S You must include one of these flags to specify either the Illumina or SOLid regression model to be used. Optional arguments: Note: Different platform models have different defaults. -s, --sample=STRING The name of the sample to be listed in the output VCF file. If not specified the sample name is harvested from the input BAM file name, taking the first group of characters before a . (dot) is found. For example, with the filename "NA12275.chrom1.bam" the sample name would be "NA12275". -z, --z-cutoff=FLOAT Defaults: Illumina:-1.0, Solid:0.5 The z-score cutoff value for the logistic regression model. Indels with a z-score less than this cutoff will not be called. Increasing this cutoff will increase specificity, but will lower sensitivity. Illumina Suggested range: -4.0 to -0.5 SOLiD Suggested range: -2 to 1 Heuristic Cutoffs: Most of these variables have already been considered by the regression model, so you shouldn't usually need to alter them. However you are free to change them to meet your specific project requirements. -t, --min-total-depth=INT Defaults: Illumina:4, SOLiD:2 The minimum total depth coverage required at an indel site. Indels at a site with less depth coverage will not be called. This cutoff may not be set lower than 4 with the Illumina model. Increasing this value will increase specificity, but lower sensitivity. Suggested range: 2-12 -m, --min-var-reads=INT Defaults: Illumina:1, SOLiD:2 The minimum number of variant reads required for an indel to be called. Increasing this number may increase specificity but will lower sensitivity. Suggested range: 1-5 -v, --min-var-ratio=FLOAT Defaults: Illumina:0.1, SOLiD:0.07 The variant-reads/total-reads cutoff. Indels with a ratio less than the specified value will not be called. Increasing this value may increase specificity, but will lower sensitivity. Suggested range: 0-0.15 -f, --strand-dir-filter Default: Illumina:disabled, SOLiD:disabled When included, requires indels to have at least one variant read in each strand direction. This filter is effective at increasing the specificity, but also carries a heavy sensitivity cost. EXAMPLES: ruby Atlas-Indel2.rb -b NA12275.chrom1.bam -r ~/refs/human_g1k_v37.fasta -o ~/NA12275.chrom1.vcf -z -2 -I ruby Atlas-Indel2.rb -b seq1.10.2010.chrom1.bam -r ~/refs/human_g1k_v37.fasta -o ~/NA12275.chrom1.vcf -t 10 -m 5 -s NA12275 -S CHANGES: * Implemented regression model for SOLiD data. You must now specify a regression model (-S or -I). * Renamed main script to Atlas-Indel.rb. * Modified Reference sequence class to allow for unsorted reference genomes. * Added the indel z-score to the info column of the VCF output (not included after running VCF printer). * Now echos all settings back onto the command line. * Fixed a bug that caused loss of precision in the normalized variant square variable of the Illumina site model. * Fixed a bug in the depth coverage algorithm that caused reads not to be counted in total depth at the deleted sites. * Fixed the sample columns order to be comaptible with vcfPrinter. * Removed "x flagged lines skipped" message at end of run.