vcf_four_gamete.py: Four Gamete Test Function

The four-gamete test is a method for determining whether or not there has been recombination between a pair of variants. To do this, all individuals must have haplotypes defined as the variants at the two sites.

../../_images/PPP_FGT.png

In this illustration of four-gamete test, the haplotypes of the samples from 197337 to 199256 (highlighted in green) pass the four-gamete test. In comparison, the haplotypes from 196944 to 197337 and from 199256 to 199492 (highlighted in red) both fail the four-gamete test as all possible haplotypes are observed.

Given phased input with individual variants over a region of the genome, four_gamete generates an interval within those variants that passes the four-gamete filtering criteria, then return either that interval or an output file with variants in that interval.

Common usage for this function is to input a VCF file that contains variants for individuals at a single locus, with output returned being a VCF that contains a subsample of these variants. A full VCF can be used with --vcfreg, where the second argument is a BED file with one or more regions, output will be either a VCF for four-gamete passing regions or a new BED file with the truncated regions.

Input Arguments

--vcfs <input_vcf_1>...*<input_vcf_n>*
Input name of one or more VCF files, where each VCF represents a locus.
--vcfreg <input_vcf> <BED file>
Input name of VCF file containing genome data and name of BED file with regions to be analyzed.

Output Aguments

--out <output_filename>
Name for output file.
--out-prefix <ouput_prefix>
If multiple files are output, this option is required to set a prefix for the output files.

Interval Arguments

--numinf <minimum informative site count>
Region returned must have at least n informative sites, defaults to 1
--hk
If set, returns intervals with at least one recombination event instead of regions with no recombination.
--reti
This script will generate a list of valid regions with no recombination. Selecting this option will return a single interval as specified by other arguments
--retl
Returns all valid intervals, either as a list of intervals or multiple output files

Single Returned Region Arguments

Select one of: --rani

Returns random interval (default)
--ranb
Returns random interval, with probability of interval proportional to interval length
--left
Return first interval with enough informative sites
--right
Return last interval with enough informative sites
--maxlen
Return interval with most informative sites

Other Arguments

--remove-multiallele
Removes multi-alleleic sites from analysis
--include-missing
Include sites with missing data in analysis
--ovlps
Extend region to include non-informative variants between an edge variant and a variant that breaks the four-gamete criteria
--ovlpi
Include informative variants from overlapping regions