Data Selection

Data selection is perhaps the most important part of the data extraction process. In order to extract meaningful information, a data set must be chosen that allows for accuracy to be tested.

For our case study, we will use Wouter Bulten's prostate histopathology image data set, which includes 160 prostate histopathology images, some of which are of cancerous areas and some of which are not. This set was chosen as an adequate data set for variable extraction and image classification due to its inclusion of whether or not an image corresponds to a cancerous region of the patient's prostate. This allows for classification methods to be tested easily on the data set. Since the cancerous images are known, the accuracy of predictions of cancer in the image based on extracted variables can easily be tested.


Cancerous Image From Data Set