stat_sampler.py: STAT File Sampler

As a single statistic file may include far more loci/windows than a technique is capable of analyzing, it is often necessary to sample the loci/windows from the file. Given a statistic file and a sampling scheme, stat_sampler will generate a pseudorandomly sampled file.

../../_images/PPP_STAT_Sample.png

In this illustration of the sampling process, the loci found within Data.VCF are pseudorandomly sampled using the corrdinates found within the given statistic file.

Two pseudorandomly sampling schemes are provided: i) a random sampler that will randomly select loci/windows and ii) a uniform sampler that will evenly sample across equal-sized bins of the given statistic. Please note that all sampling is done without replacement.

For BED-based sampling, please see ../Utilities/bed_utilities.rst.

Command-line Usage

The statistic sampler may be called using the following command:

stat_sampler.py

Example usage

Randomly sampling 20 windows from a windowed Fst statistic file merged_chr1_10000.windowed.weir.fst.

stat_sampler.py --statistic-file examples/files/merged_chr1_10000.windowed.weir.fst --calc-statistic windowed-weir-fst --sampling-scheme random --sample-size 20

Uniform sampling 20 windows from four bins from a windowed pi statistic file merged_chr1_10000.windowed.pi.

stat_sampler.py --statistic-file examples/files/merged_chr1_10000.windowed.pi --calc-statistic window-pi --sampling-scheme uniform --uniform-bins 4 --sample-size 20

Input Command-line Arguments

--statistic-file <statistic_filename>
Argument used to define the filename of the statistic file for sampling.

Output Command-line Arguments

--out <output_filename>
Argument used to define the complete output filename, overrides --out-prefix. Cannot be used if multiple output files are created.
--out-prefix <output_prefix>
Argument used to define the output prefix (i.e. filename without file extension)
--overwrite
Argument used to define if previous output should be overwritten.

Sampling Command-line Arguments

--calc-statistic <windowed-weir-fst, TajimaD, window-pi>
Argument used to define the statistic to be sampled. Windowed Fst (windowed-weir-fst), Tajima's D (TajimaD), and windowed nucleotide diversity (window-pi).
--sampling-scheme <random, uniform>
Argument used to define the sampling scheme. Random [Default] sampling or uniform sampling across of number of equal-sized bins.
--uniform-bins <bin_int>
Argument used to define the number of bins in uniform sampling.
--sample-size <sample_size_int>
Argument used to define the total sample size. If using the uniform sampling scheme, this number must be divisible by the number of bins.
--random-seed <seed_int>
Argument used to define the seed value for the random number generator.