Terer Mercy Chepkirui

Terer Mercy Chepkirui

Student Bio

Mercy Chepkirui Terer is a data scientist with vast experience in software engineering and data management. Ms Terer did her bachelor of science degree in computer science at the University of Eldoret(2010-2014). She has demonstrated technical analytical and research skills in several projects that have led to effectiveness in data usage and quality in survey results at KEMRI-Wellcome Trust Research Programme where she currently works. Her background in computer science has greatly shaped her computational and algorithmic problem-solving skills. She holds a master’s degree in biostatistics from the University of Nairobi(2018-2020).

Project Summary

Project Title:A Comparative Analysis of Unsupervised Outlier Detection Methods for Data Quality Assurance

Project Abstract

Data quality assurance is a key component in research. It is almost impossible to routinely check for errors in large datasets if automated smart mechanisms are not put in place. Good quality data leads to effective and unbiased reporting. Errors introduced into the data are inevitable hence the need to have error-checking mechanisms. Error checking mechanisms such as the use of range checks, quantile ranges, and z-scores are limited to continuous data types and small feature space data. Errors in dichotomous and character data types are easily omitted hence the need to use methods that scan anomalies for all data types in large datasets. Two-Pass Verification (TPV) on the other hand is a gold standard method for checking the quality state of data. However, it is a tedious and manual process that relies on random sampling for larger datasets. We propose possible alternative methods for error checking by applying machine learning outlier detection algorithms. The observations that are outlying are subjected to cross-referencing for possible errors instead of randomly selecting a set of observations. We evaluated k-means clustering and isolation forest unsupervised machine learning algorithms to detect outliers. We then compared TPV, k-means, and isolation forest anomaly scores. Normalized mutual information score and the coefficient of determination metrics were used to determine the strength of the correlation. The results indicated that unsupervised machine learning methods can be possible alternatives for data quality assurance. Isolation forest performed best compared to k-means clustering.