vcf_to_sfs.py: Site Frequency Spectrum generator¶
For generating the site frequency spectrum (sfs)for a population model from a vcf file.
The sfs is an array with as many dimensions as populations in the model. For example, if population samples are in order A,B, C then position (i,j,k) of the array refers to the count of SNPs with derived alleles that were observed to have a count of i in A, j in B, and k in C
If the sfs is folded then the count in a cell of the sfs is the number of SNPs with that combination of minor allele counts.
This script can be run in stand-alone mode, or for more flexibility it can be imported to give access to more general functions for building and manipulating the sfs.
All vcf handling assumes that all individuals are diploid at all SNPs.
- --vcf <input_vcf_filename>
- The name of the vcf file. This can be a bgzipped vcf file. .
- --model-file <model_file_name>
- The name of a PPP model file.
- --modelname <model_name>
- The name of a model in the model file. The treemix file to be generated will contain the allele counts for each SNP in each of the populations. The treemix run will estimate the phylogeny for the populations in the model.
- --out <out file name>
- The name of an output file. If --out is omitted the default is ppp_sfs.out in the same folder as the vcffile This will be a tab-delimited file If the number of dimensions is 2, the sfs is contained in the rows and columns, otherwise the values are given on the first line of the file
- --bed-file <BED_file_name>
- The BED file is a sorted UCSC-style bedfile containing chromosome locations of the SNPs to be included in the output files. The BED file has no header. The first column is the chromosome name (this must match the chromosome name in the vcf file). The second column is start position (0-based, open interval) The third column is end position (closed interval). Any other columns are ignored.
- --outgroup_fasta <name of alternative reference sequence>
This option is used to specify the name of a fasta file to use as an alternative reference to that originally used for the vcf file.
This fasta file must have been properly aligned to the reference used in the vcf file.
This option can be useful, for example, if an ancestral or outgroup reference is available that more accurately identifies the ancestral (and thus derived) allele at each SNP than does the reference used to make the vcf file.
- --downsamplesizes <down sample sizes>
- A sequence of integers, one for each of the populations in the model in the same order as populations listed in the model. The values specify the down sampling to be used for each respective population. For a population with k>=1 diploid individuals (2k>=2 genomes) in the model, the downsample count d must be 2<=d<=2k.
- --folded <True/False>
- The folded option indicates that the folded sfs should be returned. If folded is False (default) the sfs reports the count of the derived allele. If True, the sfs reports of the count of the minor (less frequent) allele.
- --randomsnpprop <floating point value between 0 and 1>
- This option can be used to randomly sample a subset of SNPs. The default is to sample all biallelic SNPs.
- --seed <integer>
- This is used with --randomsnpprop as the seed for the random number generator.
- --makeint <True/False>
- If True, round the counts in the sfs to the nearest integer (default False)
vcf_to_sfs.py --vcf pan_example2.vcf.gz --model-file panmodels.model --modelname 4Pop --downsamplesizes 3 3 3 4 --folded --outgroup-fasta chr22_pan_example2_ref.fa --out vcf_to_sfs_test1.txt
This file has two functions that can be useful for working with site frequency spectra
The default script that runs when this file is run. This function can also be accessed directly by importing this file.
- generates an sfs from a vcf file
- can handle an arbitrary number of dimensions (populations) and sample sizes, so long as each population has at least a sample size of two genomes (i.e. one diploid individual)
- handles downsampling, and reduction of dimensions
- handles unfolded and folded sfs's
- can take a BED file name to sample portions of a vcf file
- can handle an alternative reference genome for rooting, rather than that used in the vcf file
- the arguments closely resemble those used when the function is called by running this file
The first three arguments are, in order, the vcf filename, the model file name, and the name of the model. These are each strings.
Each optional argument requires the use of the argument name.
- The name of a ucsc-style bedfile with intervals to include
- The name of a fasta sequence file that contains the reference genome
- True/False. To indicate that the folded sfs should be returned. (False is default)
- A list of sample sizes to be used if they are less than given in the model 2 <= downsamplesizes[i] <= samplesizes[i]
- The proportion of snps to include using random sampling
- A random number seed that can be used with randomsnpprop.
- Causes the array to be rounded to the nearest integer (dtype remains float)
- The name of a file to contain the sfs if out is not None, this will write a tab-delimited file of the array
Example python code:
1 2 3 4 5
import vcf_to_sfs as vs mysfs = vs.build_sfs(pan_example2.vcf.gz,panmodels.model,'4Pop', folded=True,downsamplesizes=[3,3,3,4], altreference='chr22_pan_example2_ref.fa', out = 'mysfsfile.txt')
This function is for reducing the dimensionality of an sfs by summing across axes. It is accessed by importing this file.
There are three required arguments in order:
- the sfs (i.e. a numpy array with as many dimensions as populations)
- an instance of Class model that specifies the populations and samples to which the sfs corresponds.
- a list of names of the populations to keep in the reduced sfs
There is one optional argument, 'out', if the reduced sfs is to be written to a file (e.g. out=mysfs.txt)
Example python code:
1 2 3
import vcf_to_sfs as vs myreducedsfs = vs.reduce_sfs_dim(mysfs,mypopmodel, ['A','B','C'],out="myreducedsfs_file.txt")