vcf_bed_to_seq.py: Generate sequences from VCF/BED Files

For generating a file with reconstituted sequences for regions of a vcf file.

Given a vcf file, a corresponding fasta reference file, a population model, and a specific interval, this script will return a file containing reconstituted sequences - two for each diploid individual.

This script can be run in stand-alone mode, or for more flexibility it can be imported to give access to more general functions for generating reconstituted sequences.

Required Arguments

--vcf <input_vcf_filename>
The name of the vcf file. This can be a bgzipped vcf file. .
--model-file <model_file_name>
The name of a PPP model file.
--modelname <model_name>
The name of a model in the model file.
--fasta-reference <reference fasta file>
The reference genome fasta file is required in order to generate full sequences from the SNP data in the vcf file.
--region <region string>
The region of the reference to return sequences from. Format: chromosome name,colon (:),first base number,dash(-),last base number. e.g. chr1:100392-101391
--out <out file name>
  • the name of an output file. If --out is omitted the default is ppp_vcf_to_sequences.out

Optional Aguments

--return-single <True/False>
If true, only a single sequence is returned for each individual in the model. The sequence for a given individual uses the first allele given for that individual for each SNP in the vcf file.

Example usage

Example command-lines:

vcf_bed_to_seq.py -h
vcf_bed_to_seq.py --vcf pan_example.vcf.gz --fasta-reference pan_example_ref.fa --model-file panmodels.model --modelname 4Pop --region 21:4431001-4499000  --out vcf_bed_to_seq_test.out

Importing Functions

This file has two functions that can be useful for getting lists of reconstituted sequences from a vcf file and a fasta reference file.

get_model_sequences_from_region()

Returns a list of sequences for a region in a vcf file and a samtools/pysam style region string.

Required Arguments

Each argument requires the use of the argument name.
vcf
Either a vcf_reader (i.e. see vf.VcfReader()) or the name of a vcf file (can be bgzipped)
popmodel
An instance of Class model. The individuals in the model must also be in the vcf file
seq_reference
A string that either contains the DNA sequence for the region or is a string containing the name of a fasta file. If a fasta file name, the chromosome name(s) in the fasta file must match those in the vcf file
region
Either a samtools/pysam style region string ("chromosome name:start-end") where the chromosome name matches that used in the vcf file where start and end are the 1-based endpoints (closed interval) Or an instance of class Region (Regions uses 0-based open interval on the left)
return_single
If True, return only the first sequence for an individual else, return two sequences (default False)

Optional arguments

Each optional argument requires the use of the argument name.
return_single
If True, return only the first sequence for an individual else, return two sequences (default False)
out
The name of a file to which the sequences are to be written (or appended if the file exists).

Example usage

Example python code

1
2
3
4
5
import vcf_bed_to_seq as vbs
myregion,sequences = vbs.get_model_sequences_from_region(vcf="pan_example.vcf.gz",
          popmodel=mymodel,seq_reference="pan_example_ref.fa",
          region="21:4431001-4499000",return_single=False,
          out = "chr21regionsequences.out")

get_model_sequences()

Returns a generator for getting sets of sequences from regions given in a BED file for individuals in a model. Each call to next() returns a list of sequences for the next region in the BED file

Arguments

Each argument requires the use of the argument name. All arguments are required with the exception taht either BED_filename or region_string (but not both) must be used.
vcf
the name of the vcf file (can be bgzipped)
model_file
name of a model file
modelname
The name of a model in model_file Either both model_file and modelname must be used, or popmodel must be used
popmodel
An instance of class model. The individuals in the model must also be in the vcf file Either popmmodel or both model_file and modelname must be used
fasta_reference
A fasta file with one more sequences corresponding to the vcf file typically these are reference chromosome sequences the chromosome name(s) in the fasta file must match those in the vcf file
BED_filename
A sorted BED file giving regions from which to pull sequences the first column with the chromosome name must match a name in the fasta_reference file The chromosome names in the BED file must match those in the fasta file and the vcf file
region_string
A pysam style region string, if only one region is to be returned

Example usage

Example python code:

1
2
3
4
5
6
7
8
9
a = get_model_sequences(vcf="pan_example.vcf.gz",
            popmodel=mymodel,fasta_reference="pan_example_ref.fa",
            BED_filename = "pan_example_regions.bed")
while True:
    try:
        s = next(a)# Will raise StopIteration if lines is exhausted
        print(len(s),len(s[0]))
    except StopIteration:
        break   # end loop