Lesson 3: Data Preparation
Getting started: Cleaning the data
Once we understand the problem we are trying to solve, and have identified and collected the data, we must clean the data. We do this by identifying any quality issues with the dataset. Is there anything unusual about the data in the dataset?
Some examples of data quality issues are:
- Missing values: Values that display 0, blank, NA, or null values when we expect an actual value
- Irregular cardinality: Multiple words referring to the same value, for example: Male, male, M, m, man, men, boy.
- Outliers: Extremely high or low values that may be unexpected. These values can really skew the results of our dataset.
How we clean data
Multiple approaches are used to clean data.
Missing Values
The methods for handling missing data are:
- Check the % of missing values for a given column or feature.
- Drop features: Drop any features that have missing values.
- Drop instances: Apply complete case analysis.
- Derive missing indicators: Derive a missing indicator from features with missing values.
- Imputation
- Replace missing feature values with a plausible estimate
- The most common approach if undecided would be to replace missing feature values with a measure of central tendency of feature (e.g., average, median, mode)
- However, this would not be recommended to use on features missing more than 30% of their values
- And if you have a feature/column in the dataset/data frame with more than 50% of their values missing, we should drop this feature as it would skew the data
Irregular cardinality
The number of distinct or unique values present for a feature in a data set. Cardinality can be low or high:
- Low cardinality: Examples include categorical features incorrectly labelled as numeric (continuous) or vice-versa. For example, 1 to signify male, 0 to signify female.
- High cardinality: Having a lot of distinct values. For example, gender having six different levels but also being inconsistent, such as, male, female, M, F, m, f.
Outliers
Extreme high or low feature values in a data set. These can be as a result of invalid data (e.g. data entry errors) or they can be valid data (e.g. huge salaries). We use a technique called Clamping to deal with these values.
- Replace all values above/below an upper/lower threshold, removing outliers.
- For feature A, and thresholds lower and upper xxxxxxxxxxxx
- Thresholds set based on domain-knowledge or data distribution
Basic workflow for addressing a dataset
- Describe the features within the dataset.
- Identify data quality issues withing the dataset and the features.
- Correct data quality issues due to invalid data.
- Recored data quality issues due to invalid data in a data quality plan along with potential handling strategies.
- Does enough good quality data exist to continue with a project?
- If so, once the dataset has been cleaned, it ould be used for modeling.