vcf_four_gamete.py: Four Gamete Test Function¶
The four-gamete test is a method for determining whether or not there has been recombination between a pair of variants. To do this, all individuals must have haplotypes defined as the variants at the two sites.
In this illustration of four-gamete test, the haplotypes of the samples from 197337 to 199256 (highlighted in green) pass the four-gamete test. In comparison, the haplotypes from 196944 to 197337 and from 199256 to 199492 (highlighted in red) both fail the four-gamete test as all possible haplotypes are observed.
Given phased input with individual variants over a region of the genome, four_gamete generates an interval within those variants that passes the four-gamete filtering criteria, then return either that interval or an output file with variants in that interval.
Common usage for this function is to input a VCF file that contains variants for individuals at a single locus, with output returned being a VCF that contains a subsample of these variants. A full VCF can be used with --vcfreg, where the second argument is a BED file with one or more regions, output will be either a VCF for four-gamete passing regions or a new BED file with the truncated regions.
Input Arguments¶
- --vcfs <input_vcf_1>...*<input_vcf_n>*
- Input name of one or more VCF files, where each VCF represents a locus.
- --vcfreg <input_vcf> <BED file>
- Input name of VCF file containing genome data and name of BED file with regions to be analyzed.
Output Aguments¶
- --out <output_filename>
- Name for output file.
- --out-prefix <ouput_prefix>
- If multiple files are output, this option is required to set a prefix for the output files.
Interval Arguments¶
- --numinf <minimum informative site count>
- Region returned must have at least n informative sites, defaults to 1
- --hk
- If set, returns intervals with at least one recombination event instead of regions with no recombination.
- --reti
- This script will generate a list of valid regions with no recombination. Selecting this option will return a single interval as specified by other arguments
- --retl
- Returns all valid intervals, either as a list of intervals or multiple output files
Single Returned Region Arguments¶
Select one of: --rani
Returns random interval (default)
- --ranb
- Returns random interval, with probability of interval proportional to interval length
- --left
- Return first interval with enough informative sites
- --right
- Return last interval with enough informative sites
- --maxlen
- Return interval with most informative sites
Other Arguments¶
- --remove-multiallele
- Removes multi-alleleic sites from analysis
- --include-missing
- Include sites with missing data in analysis
- --ovlps
- Extend region to include non-informative variants between an edge variant and a variant that breaks the four-gamete criteria
- --ovlpi
- Include informative variants from overlapping regions