bed_utilities.py: BED Utilites

Automates various utilites for BED-formatted files. This currently includes: i) sample a BED file; ii) subtract from a BED that overlap with a second BED file; iii) extend a BED upstream, downstream, or both upstream and downstream; iv) sort a single BED; v) merge features within one or more BED files; vi) create a BED of complementary features.

Command-line Usage

The BED utilites function may be called using the following command:

bed_utilities.py

Utilites

Windows Utility

Given a chromosome size file and a window size, the windows utility will generate a BED file of interval features.

Example usage

Return a BED with interval features that do not extend outside the chromosomes:

bed_utilities.py --utility windows --chrom-file hg18.chrom.sizes --window-size 1000 --out hg18_windows.bed

Sample Utility

../../_images/PPP_BED_Sample.png

Given a BED file and a sample size, the sample utility will generate a pseudorandomly sampled BED. Please note that the random seed may be used to reproduced the sample.

Example usage

Sample 20 features from a BED file:

bed_utilities.py --utility sample --bed examples/files/chr1_sites.bed --sample-size 20

Sort Utility

Given an unsorted BED file, the sort utility will generate a sorted BED file.

Example usage

Sort an unsorted BED file:

bed_utilities.py --utility sort --bed examples/files/chr1_sites.unsorted.bed

Extend Utility

Given a BED file and an extend length, the extend utility will increase the length of each feature upstream, downstream, or both upstream and downstream.

Example usage

Extend upstream by 1kb:

bed_utilities.py --utility extend --bed examples/files/chr1_sites.bed --chrom-file examples/files/chr_sizes.txt --extend-upstream 1000

Extend downstream by 1kb:

bed_utilities.py --utility extend --bed examples/files/chr1_sites.bed --chrom-file examples/files/chr_sizes.txt --extend-downstream 1000

Extend flanks (i.e. both upstream and downstream) by 1kb:

bed_utilities.py --utility extend --bed examples/files/chr1_sites.bed --chrom-file examples/files/chr_sizes.txt--extend-flanks 1000

Subtract Utility

Given two BED files, the subtract utility will remove BED features from a BED file if they overlap with the features from a second BED file.

Example usage

Remove BED features if they overlap features within the subtract BED file:

bed_utilities.py --utility subtract --bed examples/files/chr1_sites.bed --subtract-bed examples/files/chr1_sites.1.bed --subtract-entire-feature

Complement Utility

Given a BED file, the complementary utility will generate a BED file of complementary features.

Example usage

Return a BED with features that do not overlap within the given file:

bed_utilities.py --utility complement --bed examples/files/chr1_sites.bed --chrom-file examples/files/chr_sizes.txt

Intersect Utility

Given a BED file and an intersect file, return only the interval features within the BED file that overlap with the intersect file.

Example usage

Return a BED with only intersecting interval features:

bed_utilities.py --utility intersect --bed hg18_windows.bed --intersect-file Intersect.vcf.gz --out hg18_intersects.bed

Merge Utility

Given one or more BED files, the merge utility will generate a single sorted BED file of merged BED features.

Example usage

Merge BED features from a single BED file:

bed_utilities.py --utility merge --bed examples/files/chr1_sites.bed 

Merge BED features from multiple BED files:

bed_utilities.py --utility merge --beds examples/files/chr1_sites.1.bed examples/files/chr1_sites.2.bed examples/files/chr1_sites.3.bed examples/files/chr1_sites.4.bed

Dependencies

Input Command-line Arguments

--bed <input_filename>
Argument used to define the filename of the BED file.
--beds <input_filename> <input1_filename, input2_filename, etc.>
Argument used to define the filename of the BED file(s). May be used multiple times.
--chrom-file <chrom_filename>

Argument used to define the filename of a file with the sizes of each chromosome. Chromosome size files must be tab-delimited as follows:

chr1        247249719
chr2        242951149
...
chrX        154913754
chrY        57772954

Appropriate files may be downloaded from the UCSC Genome Browser. The supported ASSEMBLY.chrom.sizes file for each assembly may be found by clicking Genome sequence files and select annotations (followed by Standard genome sequence files and select annotations on select assemblies).

Output Command-line Arguments

--out <output_filename>
Argument used to define the complete output filename.
--overwrite
Argument used to define if previous output should be overwritten.

Utility Command-line Specification

--utility <sample, subtract, extend, sort, merge, complement>
Argument used to define the desired utility. Current utilities include: sample features from a BED file (sample); subtract features from a BED file that overlap with features within a second BED file (subtract); extend the flanks of features upstream, downstream, or both within a single BED file (extend); sort the features within a single BED file (sort); merge features within one or more BED files (merge); create a BED file of complementary features - i.e. features that do not overlap - from a BED file (complement).

Window Utility Command-line Arguments

--window-size <window_size_int>
Argument used to define the window/interval size to return.

Sample Utility Command-line Arguments

--sample-size <sample_size_int>
Argument used to define the total sample size.
--random-seed <seed_int>
Argument used to define the seed value for the random number generator.

Subtract Utility Command-line Arguments

--subtract-bed <subtract_file_filename>
Argument used to define the BED file used for removing features/positions.
--subtract-entire-feature
Argument used to define if entire features within the input BED should be removed if they overlap with features in subtract-bed.
--min-reciprocal-overlap <overlap_float>
Argument used to define the minimum reciprocal overlap of features required for removal (e.g. 0.1 indicates 10% overlap).
--min-input-overlap <overlap_float>
Argument used to define the minimum overlap of input features required for removal.
--min-subtract-overlap <overlap_float>
Argument used to define the minimum overlap of subtract-bed features required for removal.
--subtract-entire-feature
Argument used to define that features should be removed from the input BED if the minimum overlap of --min-input-overlap or --min-subtract-overlap is reached.

Extend Utility Command-line Arguments

--extend-flanks <bp_int>
Argument used to define the length of base pairs (bp) to extend both upstream and downstream of features.
--extend-upstream <bp_int>
Argument used to define the length of base pairs (bp) to extend upstream of features.
--extend-downstream <bp_int>
Argument used to define the length of base pairs (bp) to extend downstream of features.

Intersect Utility Command-line Arguments

--intersect-file <intersect_file_filename>
Argument used to define the BED/VCF/VCF.gz file used to remove features that do not intersect with the given file's features/variants. removing features/positions.

Merge Utility Command-line Arguments

--max-merge-distance <bp_int>
Argument used to define the maximum distance allowed between features to be merged.