vcf_phase.py: VCF Phase Function

Phasing is an essental and frequently used process in population genetic analyses. Given an unphased VCF file and a selected phasing algorithm, vcf_phase will produce a phased VCF. Phasing may be configured using various general options (e.g. specifying Ne, including a genetic map) or algorithm-specific options (e.g. including a compatible reference panel) as needed.

../../_images/PPP_Phase.png

In this illustration of the phasing process, unphased variants (alleles divided diagonally) are converted into an estimated haplotypes (alleles divided horizontally and on seperate strands).

Command-line Usage

The VCF file phaser may be called using the following command:

vcf_phase.py

Example usage

Command-line to phase a VCF using Beagle:

vcf_phase.py --vcf examples/files/merged_chr1_10000.unphased.vcf.gz --phase-algorithm beagle

Command-line to phase a VCF using SHAPEIT:

vcf_phase.py --vcf examples/files/merged_chr1_10000.unphased.vcf.gz --phase-algorithm shapeit

Input Command-line Arguments

--vcf <input_filename>
Argument used to define the filename of the VCF file to be phased.
--model-file <model_filename>
Argument used to define the model file. Please note that this argument cannot be used with the --pop-file argument or individual-based filters.
--model <model_str>
Argument used to define the model (i.e. the individual(s) to include and/or the populations for relevant statistics). May be used with any statistic. Please note that this argument cannot be used with --pop-file argument or the individual-based filters.

Output Command-line Arguments

--out <output_filename>
Argument used to define the complete output filename, overrides --out-prefix.
--out-prefix <output_prefix>
Argument used to define the output prefix (i.e. filename without file extension)
--out-format <vcf, vcf.gz, bcf>
Argument used to define the desired output format. Formats include: uncompressed VCF (vcf); compressed VCF (vcf.gz) [default]; and BCF (bcf).
--overwrite
Argument used to define if previous output should be overwritten.

Phasing Command-line Arguments

--phase-algorithm <beagle, shapeit>
Argument used to define the phasing algorithm. BEAGLE 5.0 (beagle) [default] and SHAPEIT (shapeit). Please note: Both algorithms possess algorithm-specific arguments that may be found in their respective sections.
--Ne <Ne_int>
Argument used to define the effective population size.
--genetic-map <genetic_map_filename>
Argument used to define a genetic map file.
--phase-chr <chr>
Argument used to define a single chromosome to phase.
--phase-from-bp
Argument used to define the lower bound of positions to include. May only be used with a single chromosome.
--phase-to-bp
Argument used to define the upper bound of positions to include. May only be used with a single chromosome.
--random-seed <seed_int>
Argument used to define the seed value for the random number generator.

SHAPEIT Phasing Command-line Arguments

--shapeit-ref <ref_haps> <ref_legend> <ref_sample>
Argument used to define a reference panel. Three files are required: the reference haplotypes (.haps), the snp map (.legend), and the individual information (.sample)
--shapeit-burn-iter <iteration_int>
Argument used to define the number of burn-in iterations.
--shapeit-prune-iter <iteration_int>
Argument used to define the number of pruning iterations.
--shapeit-main-iter <iteration_int>
Argument used to define the number of main iterations.
--shapeit-states <state_int>
Argument used to define the number of conditioning states for haplotype estimation.
--shapeit-window <Mb_float>
Argument used to define the model window size in Mb.
--shapeit-force
Argument used to diable the missing data error (i.e. --force). Use at your own risk.

BEAGLE Phasing Command-line Arguments

--beagle-ref <ref_vcf, ref_bref3>
Argument used to define a reference panel VCF or bref3.
--beagle-burn-iter <iteration_int>
Argument used to define the number of burn-in iterations.
--beagle-iter <iteration_int>
Argument used to define the number of main iterations
--beagle-states <state_int>
Argument used to define the number of model states for genotype estimation.
--beagle-error <probability>
Argument used to define the HMM allele mismatch probability.
--beagle-window <cM_float>
Argument used to define the sliding window size in cM.
--beagle-overlap <cM_float>
Argument used to define the overlap between neighboring windows in cM.
--beagle-step <cM_float>
Argument used to define the step length in cM used for identifying short IBS segments.
--beagle-nsteps <windows_int>
Argument used to define the number of consecutive --beagle-steps used for identifying long IBS segments.
--beagle-path <path>
Argument used to define the path to locate beagle.jar.