GDD - Genomic Diversity Distribution
Introduction
GDD is a tool for variant analysis written in Python. It uses multiple libraries, such as Pandas, Numpy, Scikit-Allel, and others. The description and parameters for each function is described below.
Available Functions
attribute_guide
Usage:
attribute_guide(attribute_dictionary)
Description:
Prints out attribute name, attribute description, and value type of the available attributes from the INFO and FORMAT fields of a VCF file. The parameter is the following:
attribute_dictionary- dictionary extracted from the VCF file after usingextract_attributes()function.
ct_bar_plots
Usage:
ct_bar_plots(samples, vcf_dataframe, chrom_len_dataframe, window_size)
Description:
Plots a genotypic contingency histogram of two samples, depicting by color genotypes along each chromosome. It is essential to use the contingency table guide printed from the output of the ct_guide() function. The parameters consist of:
samples- list of the two samples being compared.vcf_dataframe- pandas dataframe containing theGT_sampleAandGT_sampleBfields.chrom_len_dataframe- pandas dataframe containing the chromosome names and their respective lengths.window_size- the length in base-pairs (bp) of the window size to be analyzed.
Below is an example of the output.
contingency bar genotypes plots for samples ‘PAHAT_1’ and ‘GHP-2-2’ for 3 chromosomes (click to expand)
ct_guide
Usage:
ct_guide()
Description:
Prints out a table, showing letters that serve as a guide for the contingency table bar plots output from the ct_bar_plots() function described above. The table can be seen below.
contingency table guide for the CTbarPlots function
ct_table
Usage:
ct_table(samples, vcf_dataframe, chromosome_name)
Description:
Returns a contingency table of two samples in table form. The parameters are the following:
samples- list of two samples.vcf_dataframe- pandas dataframe containing the genotypes GT of thesamplesextracted from the VCF file; the dataframe can also be mutated after filtering being done.chromosome_name- a string used to save the name of the chromosome being compared; to print the contingency table of a specific chromosome, thevcf_dataframefirst needs to be filtered for the specific chromosome; if printing contingency table of all chromosomes, typeall. Here is an example:
chromosome_filter = 'CHROM == Chr12'
chromosome_dataframe = extract_df_data(vcf_dataframe, chromosome_filter)
ct_table(samples, chromosome_dataframe, chromosome)
The output of all chromosomes would look like the following:
Contingency Table - Chromosome Chr12
SampleB |
|||||
SampleA |
0/0 |
0/1 |
1/1 |
other |
|
0/0 |
# |
# |
# |
||
0/1 |
# |
# |
# |
||
1/1 |
# |
# |
# |
||
other |
# |
||||
Here is an actual output. Notice how all the numbers inside other are the same, meaning that these numbers represent the total number of genotypes that are not 0/0, 0/1, or 1/1.
contingency table of genotypes for samples ‘PAHAT_1’ and ‘GHP-2-2’ for all chromosomes
extract_attributes
Usage:
attributes = extract_attributes(vcf_file)
Description:
Returns a dictionary of the available attributes/fields in the VCF file. The available attributes will be printed, giving the user the option to input wanted attributes besides the mandatory ones (specified in this printed output) in the function vcf_to_table() function.The parameter option is:
vcf_file- path/name to VCF file.
The output is:
attributes- dictionary containing the available attributes/fields in the VCF file.
extract_df_data
Usage:
filtered_vcf_dataframe = extract_df_data(vcf_dataframe, filter_list)
Description:
Returns a filtered vcf_dataframe. A filter list can be input in the filter_list parameter. The parameter options are:
vcf_dataframe- pandas dataframe containing all the samples’ attributes to be analyzed.filter_list- string containing comparison operators, while each comparison is separated by commas; as an example, if only SNPs variants are to be kept, while also clipping depth/coverage, chromosome, specific genotype, and quality, here is how it is done:
filter_list = "TYPE == snp, DP >= 10, DP <= 100, CHROM != mitochondria, GT_mutant == 1/1, QUAL > 1000"
filtered_vcf_dataframe = extract_df_data(vcf_dataframe, filter_list)
filter_sim_gt
Usage:
filtered_vcf_dataframe = filter_sim_gt(samples, vcf_dataframe, genotype_list)
Description:
Returns a dataframe that has filtered out variants where two samples have the same genotype, i.e. not real variants. The parameters consist of:
samples- list of 2 samples being analyzed.vcf_dataframe- pandas dataframe containing theGT_sampleNamefield per sample.genotype_list- list of genotypes to be filtered out; normally, loci where both samples have 0/0, 1/1, or other genotypes that are not 0’s or 1’s are filtered out; these other genotypes could be 0/2, 1/3, 1/2, etc.
Example:
genotypes = ['0/0', '1/1']
filtered_vcf_dataframe = filter_sim_gt(samples, vcf_dataframe, genotypes)
ct_table(samples, filtered_vcf_dataframe, 'all')
gt_bar_plots
Usage:
gt_bar_plots(samples, vcf_dataframe, chrom_len_dataframe, window_size)
Description:
Plots genotypic histograms of two samples, depicting by color 0/0, 0/1, or 1/1 genotypes along each chromosome per sample. The parameters consist of:
samples- list of the two samples being compared.vcf_dataframe- pandas dataframe containing theGT_sampleAandGT_sampleBfields.chrom_len_dataframe- pandas dataframe containing the chromosome names and their respective lengths.window_size- the length in base-pairs (bp) of the window size to be analyzed.
Below an example figure can be seen.
genotype bar plots for samples ‘PAHAT_1’ and ‘GHP-2-2’ for 3 chromosomes (click to expand)
gt_plot
Usage:
gt_plot(samples, vcf_dataframe, chrom_len_dataframe, linethickness=0.02)
Description:
Plots the 0/0, 0/1, 1/1 genotypes of each chromosome per sample in one figure. The parameters are:
samples- list of samples to focus on from VCF file.vcf_dataframe- pandas dataframe containing thesampleName_GTfield per sample.chrom_len_dataframe- pandas dataframe containing the chromosome names and their respective lengths.linethickness- default to 0.02; useful to control line thickness when only a few genotypes are present in the dataframe.
Below an example figure can be seen.
genotype plots for samples ‘PAHAT_1’ and ‘GHP-2-2’ for all chromosomes (click to expand)
gt_plots
Usage:
gt_plots(samples, vcf_dataframe, chrom_len_dataframe, linethickness=0.02)
Description:
Plots the 0/0, 0/1, and 1/1 genotypes per chromosome per sample. The parameters are:
samples- list samples being analyzed.vcf_dataframe- pandas dataframe containing thesampleName_GTfield per sample.chrom_len_dataframe- pandas dataframe containing the chromosome names and their respective lengths.linethickness- default to 0.02; useful to control line thickness when only a few genotypes are present in the dataframe.
Below an example figure can be seen.
genotype plots for samples ‘PAHAT_1’ and ‘GHP-2-2’ for 3 chromosomes (click to expand)
variant_hist
Usage:
variant_hist(samples, vcf_dataframe, chromosome, attribute, bins=50, MSTD=False, xmin=0, xmax=0)
Description:
Plots histogram of specified attribute from the vcf_dataframe. The parameters are:
samples- list of samples; used to create title of plot and to save the plot with the samples being analyzed; does not affect the output of the histogram.vcf_dataframe- pandas dataframe containing all the samples’ attributes to be analyzed.chromosome- chromosome being analyzed; if analyzing variants from all chromosomes, inserteallas the parameter; used to create title and save the plot; does not affect the output of histogram.attribute- attribute being analyzed; options are:CHROM,POS,REF,ALT,QUAL,DP,sampleName_GT,sampleName_AD,sampleName_AN,TYPE.bins- number of bins for the histogram grouping.MSTD- default toFalse; ifTrue, the histogram will include the Mean and STandard Deviation (MSTD) values, while showing vertical lines of the first ± standard deviation.xmin- default to 0; if edited, will not work unless used in combination with thexmaxparameter.xmax- default to 0, which in reality plots all the way to the maximum X value of the attribute; when bigger than 0, the x-axis will be limited to the number inserted.
Below multiple example figures for DP, TYPE and PAHAT_1_GT can be seen.
DP histogram for ‘PAHAT_1’ for all chromosomes (click to expand)
TYPE of mutations histogram for all samples and all chromosomes (click to expand)
PAHAT_1_GT histogram of sample PAHAT_1 available genotypes in all chromosomes (click to expand)
vcf_to_table
Usage:
samples, vcf_dataframe, chrom_len_dataframe = vcf_to_table(vcf_file)
Description:
Extracts information from a VCF file as input and returns 3 outputs. The parameters are:
vcf_file- path/name to VCF file.
The 3 outputs are the following:
samples- list of samples in the VCF file.vcf_dataframe- pandas dataframe containing fields specified in prompt.chrom_len_dataframe- pandas dataframe containing the chromosome names and their respective lengths.