Sometimes datasets contain missing data. Imagine a dataset with a column "age", where a lot of the ages are missing. This could pose a problem because it is impossible to perform statistical analysis when data is missing. Missing data can appear in any form including blanks and NaN. It is important to correctly identify the nature of missing data in your datasets. Values such as zeroes and unknown could easily be confused for missing data when for the given use case or problem statement they are not. Always know your missing data.
Types of Missing Data:
In addition to knowing what your missing data looks like, it is important to understand why the data is missing in the first place and the type or nature of the missing data. There are three reasons why data could be missing.
- Missing Completely at Random (MCAR)
- Missing At Random (MAR)
- Missing not at Random (MNAR)
Missing Completely at Random (MCAR):
When data is Missing Completely at Random then its missingness is entirely random, and is unrelated to or independent of any variables in the dataset. This missingness can also be considered to be unrelated to the subject of the missing data. One example would be in a health study where an insulin sample was lost by the researcher. When data is MCAR then analysis performed on the data is unbiased and still representative of the population.
Missing at Random (MAR):
When data is Missing at Random then its missingness is not related to the missing data itself but is related to some of the observed data and could be predicted from data already collected. This might occur in a situation where a researcher is unable to enter a weight for week 6 of a pregnant woman because that woman fell ill. Depending on the type of analysis required MAR could introduce bias.
Missing Not at Random (MNAR):
When data is Missing Not at Random (MNAR) then its missingness is infact related to the value of the variable or the value of other variables or both. One example would be deliberately leaving out income on a form because you don't want anyone knowing your salary. This type of missingness can definitely result in bias during analysis especially if persons within a certain income bracket are more prone to not enter their salary.
Ask for Better Data:
One option for handling a high percentage of missing data would be to ask for new data from the client.
Listwise or Complete Case Deletion:
One harsh and drastic way of handling missing data for a given column or feature would be to remove the entire row or case containing the missing feature, This would result in a smaller dataset with less rows of data. Deleting an entire row could be unfair to other columns in the dataset that actually contain data for the deleted row. The lost data could potentially add value to future analysis. Thus, this type of deletion could introduce bias when analysis is conducted.
One other option for handling missing data in a dataset would be to completely remove or delete the column(s) containing the missing data. This would depend on the business problem, use case, the value of the data stored in the column(s) and the percentage of data missing from those columns. If a column provides a significant amount of value or information for analysis, there is a strong case for keeping that feature even when a high percentage of data is missing. If on the other hand, the feature doesn't provide a lot of value and a high percentage of data is missing, then the column should be deleted.
Imputation is a strategy for replacing missing data with an estimated value.
Univariate Feature Imputation:- Central Tendency
This type of imputation uses the remaining values of the column containing missing data, to compute a substitute value for the missing data. Some examples are mean substitution, median substitution and mode (most frequent) substitution which takes into consideration the Central Tendency of the remaining data.
Univariate Feature Imputation:- Chosen Constant
Another example of Univariate Feature Imputation would be to replace the missing values with a chosen constant that probably came from researching a similar use case or business problem.
Multivariate Feature Imputation: - Regression Imputation
Imagine a dataset with a column y that contains missing data. Imagine building a regression model of y as a function of the other features (X) in the dataset. Imagine using this regression model (X,y) to then predict the missing values of y. Keep in mind that the columns or features used to build the regression model should NOT contain missing data.
Handling missing data is an important and necessary step in Exploratory Data Analysis. Adequate time should be spent on identifying the type, nature and reason for data missingness. In addition, depending on the use case and problem statement, Data Scientists should pick the best method for handling the missing data from the options of deletion and imputation.