Atlas-Indel2 for Indel Calling in Whole Exome Capture Sequencing Data
version 0.2, Feb 2011
AUTHORS: Danny Challis, Uday Evani, Fuli Yu, Human Genome Sequencing Center, Baylor College of Medicine
CONTACT: challis@bcm.edu, fyu@bcm.edu

Atlas-Indel2 is a suite of indel calling tools based on logistic regression models 
made up of pertinent variables sequencing data.  This version of Atlas-Indel includes
regression models for Illumina and SOLiD exome data.


SYSTEM REQUIREMENTS:

    * Unix-like operation systems
    * Ruby 1.9.1: http://www.ruby-lang.org/en/downloads/
    * SAMtools must be installed and runnable by invoking the "samtools" command
      -SAMtools may be obtained at http://samtools.sourceforge.net/
    

LICENSE:

This software is free for all uses with the following restrictions:
    * No part of this software, or modifications thereof, may be redistributed for 
	any purpose to any other company, person, or individual, without prior written 
	permission from the author.
    * This software is provided AS IS. Baylor College of Medicine assumes no 
	responsibility or liability for damages of any kind that may result, directly 
	or indirectly, from the use of this software.
    * The above copyright notice must be preserved in the executable About dialog or 
	made visible to users in some other way.


DATA PRE/POST PROCESSING REQUIREMENTS: 
We highly recommend local realignment around indels and high variation regions using GATK or 
other third party tools.  While you may run Atlas-Indel2 without local realignment, greater 
sensitivity is possible with it. Details on local realignment using GATK can be found at 
http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels.

Currently Atlas-Indel2 does not filter results to the capture target region.  You may use 
tools such as VCFtools (http://vcftools.sourceforge.net) to filter your results to a .bed file.

example pipeline:
	*map fastq files using BWA to get BAM files
	*sort BAM files
	*locally realign BAMs using GATK local realigner
	*run Atlas-Indel2 on BAMs to get individual VCF files
	*merge VCF files using vcfPrinter (included)
	*filter VCF file to target region using VCFtools and .bed file


USAGE:
	ruby Atlas-Indel2.rb -b input_bam -r reference_sequence -o outfile [-S or -I]
		-S --solid-exome-model (use the model trained for SOLiD exome data)
		-I --illumina-exome-model (use the model trained for Illumina exome data)

		optional arguments:
		-s --sample [by default taken from infile name] (the name of the sample)
		-z --z-cutoff (the z-score cutoff for the regression model)
		-t --min-total-depth (may not be lower than 4 for Illumina data)
		-m --min-var-reads 
		-v --min-var-ratio
		-f --strand-dir-filter (requires at least one read in each direction, 
					extremely limits sensitivity)

This usage information can be viewed at any time by running the program without arguments.

Mandatory arguments:

-b, --bam=FILE
	The input BAM file.  It must must be sorted. It does not need to be
	indexed.  A read mask of 1796 is used on the bitwise flag.

-r, --reference=FILE
	The reference sequence to be used in FASTA format.  This must be the same
	version used in mapping the sequence. It does not need to be indexed.

-o, --outfile=FILENAME
	The name of the output VCF file.  Output is a bare-bones VCFv4 file with
	a single sample.  These files can be merged into a more complete VCF
	file using the vcfPrinter (included).  If the file already exists, it will
	be overwritten. NOTE: For use with vcfPrinter, you should name your vcf file
	the same as your BAM file, simply replacing ".bam" with ".vcf"

-I or -S
	You must include one of these flags to specify either the Illumina or SOLid
	regression model to be used.

Optional arguments:
Note: Different platform models have different defaults.

-s, --sample=STRING 
	The name of the sample to be listed in the output VCF file.  If not specified
	the sample name is harvested from the input BAM file name, taking the
	first group of characters before a . (dot) is found.  For example, with
	the filename "NA12275.chrom1.bam" the sample name would be "NA12275".

-z, --z-cutoff=FLOAT
	Defaults: Illumina:-1.0, Solid:0.5
	The z-score cutoff value for the logistic regression model.  Indels with a 
	z-score less than this cutoff will not be called.  Increasing this cutoff will
	increase specificity, but will lower sensitivity.
	Illumina Suggested range: -4.0 to -0.5
	SOLiD Suggested range: -2 to 1

Heuristic Cutoffs:
Most of these variables have already been considered by the regression model, so you
shouldn't usually need to alter them.  However you are free to change them to meet 
your specific project requirements.

-t, --min-total-depth=INT
	Defaults: Illumina:4, SOLiD:2
	The minimum total depth coverage required at an indel site.  Indels at a
	site with less depth coverage will not be called.  This cutoff may not
	be set lower than 4 with the Illumina model. Increasing this value will increase specificity,
	but lower sensitivity.
	Suggested range: 2-12

-m, --min-var-reads=INT
	Defaults: Illumina:1, SOLiD:2
	The minimum number of variant reads required for an indel to be called.
	Increasing this number may increase specificity but will lower sensitivity.
	Suggested range: 1-5

-v, --min-var-ratio=FLOAT
	Defaults: Illumina:0.1, SOLiD:0.07
	The variant-reads/total-reads cutoff.  Indels with a ratio less than the
	specified value will not be called.  Increasing this value may increase
	specificity, but will lower sensitivity. 
	Suggested range: 0-0.15


-f, --strand-dir-filter
	Default: Illumina:disabled, SOLiD:disabled
	When included, requires indels to have at least one variant read in each
	strand direction.  This filter is effective at increasing the
	specificity, but also carries a heavy sensitivity cost.


EXAMPLES:
ruby Atlas-Indel2.rb -b NA12275.chrom1.bam -r ~/refs/human_g1k_v37.fasta -o ~/NA12275.chrom1.vcf -z -2 -I
ruby Atlas-Indel2.rb -b seq1.10.2010.chrom1.bam -r ~/refs/human_g1k_v37.fasta -o ~/NA12275.chrom1.vcf -t 10 -m 5 -s NA12275 -S


CHANGES:
* Implemented regression model for SOLiD data.  You must now specify a regression model (-S or -I).
* Renamed main script to Atlas-Indel.rb.
* Modified Reference sequence class to allow for unsorted reference genomes.
* Added the indel z-score to the info column of the VCF output (not included after running VCF printer).
* Now echos all settings back onto the command line.
* Fixed a bug that caused loss of precision in the normalized variant square variable of the Illumina site model.
* Fixed a bug in the depth coverage algorithm that caused reads not to be counted in total depth at the deleted sites.
* Fixed the sample columns order to be comaptible with vcfPrinter.
* Removed "x flagged lines skipped" message at end of run.