Example Jupyter Pipleine

import sys
import os
import subprocess

from pgpipe import four_gamete, vcf_split_pysam, vcf_to_ima, vcf_filter, vcf_calc, vcf_sampler, vcf_phase, stat_sampler, vcf_split
from pgpipe.logging_module import initLogger
from pgpipe.informative_loci_filter import filter_bed_regions
from pgpipe.subtract_bed import filter_stat
import pysam

print ("Imports complete")
Imports complete

Setting Filepaths

The required input files for a PPP run are: - A genome VCF of the target populations (plus a tabix index if bgzipped) - A population model file

The population model file is a JSON-formatted file that defines population names and the individuals from the VCF that belong to each population. This file can be created using the model_creator function, or by creating it manually by using an example model file as a template.

For region filtering, the following should be provided: - Name for target region file - Name for final selected region file - File with genic regions/regions to be excluded from analysis (optional) - File with regions to be selected from for analysis (optional)

There are two methods for obtaining target regions for subsampling and analysis, with some level of interoperability: using a statistic file to sample regions, or use a file with target regions to randomly select regions with enough sites to be valid in an IM analysis. To generate a statistic file, use vcf_calc to read over the input VCF file, then use stat_sampler to select either a random or uniform distribution of these regions given their statistic value. These stat files can be filtered with a genic region file, available from UCSC, using the subtract_bed function. This method is implemented in this notebook, with the genic region filtering offered as an optional cell.

An additional method of region selection, done without statistics, can be done by downloading region files for genic regions, and optionally for STR regions and regions with missing data. If one wants to find a set of target regions that are intergenic and outside of STR regions, download the corresponding files from UCSC for your species and use invert_bed_region to ‘invert’ the files, optionally selecting only regions outside of a set number of bases from the regions in the file with the –window option. If files don’t have typical BED column order, use the –bed-column-index option to provide a comma-separated string with the 0-based index of the start, end, and chromosome column. (For normal BED files, this would be ‘1,2,0’) The get_nonmissing_chunks function can scan the input VCF and find regions with no missing data. All of these can be combined with bedtools to select regions that overlap with all three possible files.

Whichever method used, the target region file should be run through the informative_filter function to check the VCF file has enough biallelic, informative SNPs (two or more of each of two alleles) to have a good chance of passing the four-gamete test (which requires regions with at least two SNPs). An informative count of 5 will usually allow for this, but if you have limited regions this threshhold can be lowered to 3.

In addition to these files, additional functionality such as CpG filtering and comprehensive logging can be used by providing: - A reference FASTA (for CpG filtering, can be bgzipped or unzipped but requires indexing w/faidx) - A log filename

This section also is used to set up the directory structure for data files, a working directory, a directory for VCF region files, a target number of loci, and names for various stages of loci VCF files (phasing, four-gamete testing, potentially filtering individuals with missing data).

#Set up directories and filepaths, run on all restarts
vcf_dir = work_dir+'vcfs/'

main_vcf_name = data_dir+'pan_chr20.vcf.gz'
filtered_vcf_name = data_dir+'pan_chr20_filtered.vcf.gz'
stat_file_name = work_dir+'fst_regions.bed'
model_file = data_dir+'great_ape.model'
int_bed_file = work_dir+'regions_for_sampling.bed'
target_loci_file = work_dir+'target_loci.bed'
ima_input_file = work_dir+'test_run_input.ima.u'
#subsamp_bed_file = work_dir+'great_ape_genome2/5k_sample.bed'
logfile = '/home/jared/testpppj.log'


region_files = [vcf_dir+'Sampled_nonmissing/Sample_'+str(i)+'.vcf' for i in range(loci)]
phased_files = [vcf_dir+'Phased/phased_'+str(i)+'.vcf' for i in range(loci)]
fourg_files = [vcf_dir+'four_gamete/Sample_'+str(i)+'.vcf' for i in range(loci)]
passed_files = []

#Set up directory structure, only needs to be run once
if not os.path.exists(vcf_dir):

The vcf_filter step will filter the original VCF according to many conditions, including: - Non-biallelic sites (–filter-min-alleles and –filter-max-alleles) - Sites with missing data (–filter-max-missing) - Indels - Sites on non-autosomal chromosomes (–filter-exclude-chr) - Individuals not named in model file (use –model-file to input model file, with –model if multiple models included in file)

#Creates VCF filtered for no missing data and biallelic sites
vcf_filter.run(['--vcf', main_vcf_name, '--filter-max-missing', '1.0', '--model-file',model_file,
                '--model','2Pop', '--filter-min-alleles', '2', '--filter-max-alleles', '2', '--out-format',
                'vcf.gz', '--out', filtered_vcf_name, '--filter-exclude-chr', 'chrX', 'chrY', '--overwrite'])

print("Filtering complete")
Filtering complete

The vcf_calc step will, for every 10kb window in the genome, calculate Fst given populations from the model file. Statistics that can be filtered over currently include: - Windowed pi - Tajima’s D - Fst

#Calculates f_st statistics across genome
vcf_calc.run(['--vcf', filtered_vcf_name, '--out', stat_file_name,
              '--calc-statistic', 'windowed-weir-fst', '--model', '2Pop', '--statistic-window-size',
              '10000', '--statistic-window-step', '10000', '--model-file', model_file, '--overwrite'])
print("Stat calculation complete")
Stat calculation complete

Using the informative site filter, the regions produced by the statistics generation can be checked for whether they contain enough informative sites to pass the four-gamete test. Additional filtering can be done here if it hasn’t been done before.

#Selects subset of regions for fast sampling
print("BED regions selected")

BED regions selected
Only 2848 of 5000 regions found

(Optional) If a file with genic regions is provided, statistic windows that overlap those regions can be removed from potential loci. The window option can be used to extend exclusion regions by a set number of base pairs up AND downstream of regions in the filter file. Zero-ho indicates that files use a zero-based, half open interval representation, as opposed to the general 1-based, closed region format.

int_bed_file2 = work_dir+'regions_for_sampling_nogenes.bed'
gene_file = work_dir+'hg18_chr22_genes.bed'
int_bed_file = int_bed_file2
WARNING:root:1773 of 2848 regions selected as non-overlapping

The statistic file can be filtered in one of two ways: randomly or uniformly. A random sample (which conceptually doesn’t require a statistic) will select sample-size numer of loci for analysis, while a uniform sample will attempt to create a uniform distribution of the chosen statistic. The samples will be placed into a set number of bins, which must be divisible by the number of target loci.


This function creates loci VCF files from the full-genome VCF, while optionally doing additional filtering

#Uniformly sample regions for subset of 200 loci
#vcf_sampler.run(['--vcf', filtered_vcf_name, '--statistic-file',
#                 target_loci_file, '--out-format', 'vcf', '--calc-statistic', 'windowed-weir-fst',
#                 '--sampling-scheme', 'uniform', '--uniform-bins', '5', '--out-dir',
#                 work_dir + 'great_ape_genome2/Sample_Files', '--overwrite'])
#              vcf_dir+'Sampled_nonmissing/Sample','--split-file',target_loci_file])

print("Sampling complete")

Sampling complete

Each locus must be prepared for IM analysis, which involves finding a subregion of each locus that passes the four-gamete test. The four-gamete test is passed if all pairs of alleles in a region have less than four gametes among them. For example, if two SNPs are A/C and G/T, there are four possible gamete haplotypes among them: AG, AT, CG, and CT. If haplotypes with all four of these are present, this indicates that there must have been a recombination event between them at some point in the sampled populations. This violates the IM model, so these regions would fail the test. The four-gamete code as implemented will compute all regions that pass the four-gamete test in a locus VCF file, then select a region either at random or with the largest number of informative sites. A minimum number of informative sites can be set, which defaults to two.

Before the four-gamete test, the locus VCF files need to be phased. This pipeline provides two phasing programs, beagle and shapeit.

#Phase locus
for i in range(loci):
print ("Phasing done")
Phasing done

Once phasing is done, each file must be filtered through the four-gamete test. The four-gamete test is a method for determining whether or not there has been recombination between a pair of variants. To do this, all individuals have haplotypes defined as the variants at the two sites. Given two snps with ref/alt alleles A/G and C/T, if individuals in this sample have haplotypes AC, AT, and GT, it is possible that there has been no recombination between these alleles. If an additional individual has the GC haplotype, this means that a recombination event must have taken place between the sites. This function will return a subregion of the region in the contained VCF that passes the four-gamete test with at least two informative (ac>1) SNPs. If no valid region is found, no VCF is created and the region is skipped for downstream analysis.

#Subsample locus for four-gamete compatible interval, if no subregion returned, do not use VCF
passed_files = []
for i in range(loci):
    ret = four_gamete.sample_fourgametetest_intervals(['--vcfs', phased_files[i], '--out',
                                                       fourg_files[i], '--4gcompat', '--reti', '--right',
                                                       '--numinf', '2'])
    if ret[0] is not None:
print ("Four gamete regions selected for %d loci"%(len(passed_files)))
Four gamete regions selected for 199 loci

This converts the files that pass the four-gamete test into a single IMa input file. Required arguments are a list of VCF files, and a model file. Filtering options are also available if unwanted sites haven’t been filtered out at a previous step.

#Create IMa input file
ima_args = ['--vcfs']
ima_args.extend(['--model-file', model_file, '--model','2Pop','--out', work_dir + 'ima_all_loci.ima.u'])

print ("IMa input created")
IMa input created

If desired, this block will run admixture to determine the population assignments of the various populations in the input VCF files. The plot will indicate, for each individual, an estimate of how much of their ancestry comes from the populations determined by clustering in admixture.

[ ]:
#Admixture analysis, optional
from pgpipe import convert, admixture, graph_plotter
phased_string = ' '.join(phased_files)
loci_vcf = vcf_dir+'Phased/phased_merged.vcf.gz'
concatcall = subprocess.Popen('vcf-concat '+phased_string+ ' | bgzip -c > '+loci_vcf, shell=True,stdout=subprocess.PIPE)
temp_out, temp_err = concatcall.communicate()
print ("Plots created")
[ ]: