vcf_to_fastsimcoal.py: VCF to fastsimcoal Conversion Function¶

Generates Site Frequency Spectrum (SFS) files for fastsimcoal based on instructions in fastsimcoal ver 2.6 manual.

Generates one-dimensional (1D), two-dimensional (2D) and multidimensional SFS files

All generated SFS files are contained in a zip file archive.

Excoffier, L. and M. Foll. 2011. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics 27: 1332-1334.

Required Arguments¶

--vcf <input_vcf_filename>

The name of the vcf file. This can be a bgzipped vcf file. .

--model-file <model_file_name>

The name of a PPP model file.

--modelname <model_name>

The name of a model in the model file. The treemix file to be generated will contain the allele counts for each SNP in each of the populations. The treemix run will estimate the phylogeny for the populations in the model.

--dim <dimension file type signifiers>

One or more of '1', '2', or 'm', for 1D, 2D or multidimensional output files.

For 1D files:

the filename suffix is _DAFpop#.obs for an array of derived allele counts. where '#' is replaced by the population number
the filename suffix is _MAFpop#.obs for an array of minor allele counts.

For 2D files:

the filename suffix is _jointDAFpop#_&.obs for an array of derived allele counts. where # and & are population numbers, and # is larger than &
the filename suffix is _jointDAFpop#_&.obs for an array of minor allele counts.

For a multidimensional file:

the filename suffix is _DSFS.obs for an array of derived allele counts.
the filename suffix is _MSFS.obs for an array of minor allele counts.

Optional Aguments¶

--basename <name of outpuf file prefix>

This is used to specify the prefix of the output files and the prefix of the zip file archive. The default is "ppp_fsc" in the same folder as the vcf file

--bed-file <BED_file_name>

The BED file is a sorted UCSC-style bedfile containing chromosome locations of the SNPs to be included in the output files. The BED file has no header. The first column is the chromosome name (this must match the chromosome name in the vcf file). The second column is start position (0-based, open interval) The third column is end position (closed interval). Any other columns are ignored.

--outgroup_fasta <name of alternative reference sequence>

This option is used to specify the name of a fasta file to use as an alternative reference to that used for the vcf file.

This fasta file must have been properly aligned to the reference used in the vcf file.

This option can be useful, for example, if an ancestral or outgroup reference is available that more accurately identifies the ancestral (and thus derived) allele at each SNP than does the reference used to make the vcf file.

--downsamplesizes <down sample sizes>

A sequence of integers, one for each of the populations in the model in the same order as populations listed in the model. The values specify the down sampling to be used for each respective population. For a population with k>=1 diploid individuals (2k>=2 genomes) in the model, the downsample count d must be 2<=d<=2k.

--folded <True/False>

The folded option indicates that the folded sfs should be returned. If folded is False (default) the sfs reports the count of the derived allele. If True, the sfs reports of the count of the minor (less frequent) allele.

--randomsnpprop <floating point value between 0 and 1>

This option can be used to randomly sample a subset of SNPs. The default is to sample all biallelic SNPs.

--seed <integer>

This is used with --randomsnpprop as the seed for the random number generator.

Example usage¶

Example command-lines:

vcf_to_fastsimcoal.py -h

vcf_to_fastsimcoal.py --vcf pan_example2.vcf.gz --model-file panmodels.model --modelname 5Pop --downsamplesizes 3 3 3 4 2  --basename vcf_fsc2 --folded --dim 1 2 m  --outgroup-fasta chr22_pan_example2_ref.fa