Data Configuration

The next step in the data extraction process is data configuration. Data sets must be configured into the proper form for whatever software that is being used for data extraction. This commonly entails removing unnecessary columns in the data and removing empty entries in the data set.

For the case study, the data set is provided as a CSV file. When uploaded to R, the data set is converted to a dataframe with 4 variables: the ID of the image, the slide that the image is found on, whether or not the image contains cancer, and the ID of an image in its respective slide.

The cancer classification variable is stored as a character variable in the dataframe. For classification, this variable needs to be stored as a numeric variable.


Code for Converting CSV file to dataframe in R


Code to add numeric variable for cancer classification