VOTERS

Voting Outliers Based-on Randomised Sampling

Abstract

Finding patterns in real data is troublesome as the normality condition does not hold for most of the measured dimensions. Most of the existing high dimensional outlier detection algorithms heavily rely on the assumption that data set is normally distributed. This research is towards the problem of solving outlier detection problem in high dimensional space for non-uniform distribution of data set along the mean. The work mainly focuses on development of a novel algorithm, called Voting Outliers based on Randomised Sampling (VOTERS), as a solution to the high dimensional outlier detection problem. This algorithm was developed using existing statistical and machine learning principles and later on integrated with implemented in Java. VOTERS is based on the idea of modifying location and scatter estimates in Mahalanobis distance followed by Mahalnobis-chisquare measure votes at adapted significance level based on the heterogeneity of the data set. The mathematical foundation of the research work comes from areas such as dimensionality reduction using Principal component analysis, bootstrapping techniques to randomise sampling in the cluster to generate votes for each data and statistical confidence analysis of chi-square Mahalanobis data plot based on heterogeneity level of the data set. Outlier detection results were validated by comparing to artificially inserted known outliers and provided probable list of outliers. The performance of the proposed algorithm was found to be significantly better than other existing multivariate outlier detection algorithms. This algorithm was further improved by introducing an adaptive VOTERS characteristic based on sample size and uniformity using the parameters learned from tens of thousands of simulations.