vcf_utilities.py: VCF Utilities

Automates various utilites for VCF-formatted files. This currently includes: obtain a list of the chromosomes within a VCF-based file, obtain a list of the samples within a VCF-based file, concatenate multiple VCF-based files, merge multiple VCF-based files, and sort a VCF-based file.

Command-line Usage

The VCF utilites function may be called using the following command:

python vcf_utilites.py

Example usage

Concatenate multiple VCF files:

python vcf_utilites.py --vcfs chr21.vcf.gz chr22.vcf.gz --utility concatenate

Merge multiple VCF files:

python vcf_utilites.py --vcfs chr22.ceu.vcf.gz chr22.yri.vcf.gz --utility merge

Dependencies

Input Command-line Arguments

--vcf <input_filename>
Argument used to define the filename of the VCF file.
--vcfs <input_filename> <input1_filename, input2_filename, etc.>
Argument used to define the filename of the VCF file(s). May be used multiple times.

Output Command-line Arguments

--out <output_filename>
Argument used to define the complete output filename, overrides --out-prefix.
--out-prefix <output_prefix>
Argument used to define the output prefix (i.e. filename without file extension)
--overwrite
Argument used to define if previous output should be overwritten.

Utility Command-line Specification

--utility <sample-list, chr-list, concatenate, merge, sort>
Argument used to define the desired utility. Current utilities include: creation of a file of the samples within the VCF (sample-list); creation of a file of the chromosomes within the VCF (chr-list); combine multiple VCF files with different variants but the same samples (concatenate); combine multiple VCF files with different samples but the same variants (merge); or sort a single VCF file (sort).

Additional Utility Command-line Arguments

--record-merge-mode <none, snps, indels, both, all, id>
Argument used to define the type of multiallelic records to create. Only usable with the merge utility.
--record-missing-as-ref
Argument used to define that missing records should be converted to the reference allele. Only usable with the merge and concatenate utilites.
--out-format <vcf, vcf.gz, bcf>
Argument used to define the desired output format. Formats include: uncompressed VCF (vcf); compressed VCF (vcf.gz) [default]; and BCF (bcf). Only usable with the merge and concatenate utilites.