vcf_phase.py: VCF Phase Function¶
Phasing is an essental and frequently used process in population genetic analyses. Given an unphased VCF file and a selected phasing algorithm, vcf_phase will produce a phased VCF. Phasing may be configured using various general options (e.g. specifying Ne, including a genetic map) or algorithm-specific options (e.g. including a compatible reference panel) as needed.
In this illustration of the phasing process, unphased variants (alleles divided diagonally) are converted into an estimated haplotypes (alleles divided horizontally and on seperate strands).
Command-line Usage¶
The VCF file phaser may be called using the following command:
vcf_phase.py
Example usage¶
Command-line to phase a VCF using Beagle:
vcf_phase.py --vcf examples/files/merged_chr1_10000.unphased.vcf.gz --phase-algorithm beagle
Command-line to phase a VCF using SHAPEIT:
vcf_phase.py --vcf examples/files/merged_chr1_10000.unphased.vcf.gz --phase-algorithm shapeit
Dependencies¶
Input Command-line Arguments¶
- --vcf <input_filename>
- Argument used to define the filename of the VCF file to be phased.
- --model-file <model_filename>
- Argument used to define the model file. Please note that this argument cannot be used with the --pop-file argument or individual-based filters.
- --model <model_str>
- Argument used to define the model (i.e. the individual(s) to include and/or the populations for relevant statistics). May be used with any statistic. Please note that this argument cannot be used with --pop-file argument or the individual-based filters.
Output Command-line Arguments¶
- --out <output_filename>
- Argument used to define the complete output filename, overrides --out-prefix.
- --out-prefix <output_prefix>
- Argument used to define the output prefix (i.e. filename without file extension)
- --out-format <vcf, vcf.gz, bcf>
- Argument used to define the desired output format. Formats include: uncompressed VCF (vcf); compressed VCF (vcf.gz) [default]; and BCF (bcf).
- --overwrite
- Argument used to define if previous output should be overwritten.
Phasing Command-line Arguments¶
- --phase-algorithm <beagle, shapeit>
- Argument used to define the phasing algorithm. BEAGLE 5.0 (beagle) [default] and SHAPEIT (shapeit). Please note: Both algorithms possess algorithm-specific arguments that may be found in their respective sections.
- --Ne <Ne_int>
- Argument used to define the effective population size.
- --genetic-map <genetic_map_filename>
- Argument used to define a genetic map file.
- --phase-chr <chr>
- Argument used to define a single chromosome to phase.
- --phase-from-bp
- Argument used to define the lower bound of positions to include. May only be used with a single chromosome.
- --phase-to-bp
- Argument used to define the upper bound of positions to include. May only be used with a single chromosome.
- --random-seed <seed_int>
- Argument used to define the seed value for the random number generator.
SHAPEIT Phasing Command-line Arguments¶
- --shapeit-ref <ref_haps> <ref_legend> <ref_sample>
- Argument used to define a reference panel. Three files are required: the reference haplotypes (.haps), the snp map (.legend), and the individual information (.sample)
- --shapeit-burn-iter <iteration_int>
- Argument used to define the number of burn-in iterations.
- --shapeit-prune-iter <iteration_int>
- Argument used to define the number of pruning iterations.
- --shapeit-main-iter <iteration_int>
- Argument used to define the number of main iterations.
- --shapeit-states <state_int>
- Argument used to define the number of conditioning states for haplotype estimation.
- --shapeit-window <Mb_float>
- Argument used to define the model window size in Mb.
- --shapeit-force
- Argument used to diable the missing data error (i.e. --force). Use at your own risk.
BEAGLE Phasing Command-line Arguments¶
- --beagle-ref <ref_vcf, ref_bref3>
- Argument used to define a reference panel VCF or bref3.
- --beagle-burn-iter <iteration_int>
- Argument used to define the number of burn-in iterations.
- --beagle-iter <iteration_int>
- Argument used to define the number of main iterations
- --beagle-states <state_int>
- Argument used to define the number of model states for genotype estimation.
- --beagle-error <probability>
- Argument used to define the HMM allele mismatch probability.
- --beagle-window <cM_float>
- Argument used to define the sliding window size in cM.
- --beagle-overlap <cM_float>
- Argument used to define the overlap between neighboring windows in cM.
- --beagle-step <cM_float>
- Argument used to define the step length in cM used for identifying short IBS segments.
- --beagle-nsteps <windows_int>
- Argument used to define the number of consecutive --beagle-steps used for identifying long IBS segments.
- --beagle-path <path>
- Argument used to define the path to locate beagle.jar.