Finding patterns in real data is troublesome as the normality condition does not hold for most of the measured dimensions. Most of the existing high dimensional outlier detection algorithms
heavily rely on the assumption that data set is normally distributed. This research is towards the problem of solving outlier detection problem in high dimensional space for
non-uniform distribution of data set along the mean.
The work mainly focuses on development of a novel algorithm, called Voting Outliers based on Randomised Sampling (VOTERS), as a solution to the high dimensional outlier detection problem. This algorithm
was developed using existing statistical and machine learning principles and later on integrated with implemented in Java.
VOTERS is based on the idea of modifying location and scatter estimates in Mahalanobis distance followed by Mahalnobis-chisquare measure votes at adapted significance level based on the heterogeneity of the data set.
The mathematical foundation of the research work comes from areas such as dimensionality reduction using Principal component analysis, bootstrapping techniques to randomise sampling in the cluster to generate votes for each data
and statistical confidence analysis of chi-square Mahalanobis data plot based on heterogeneity level of the data set.
Outlier detection results were validated by comparing to artificially inserted known outliers and provided probable list of outliers. The performance of the proposed algorithm was found to be significantly better than other
existing multivariate outlier detection algorithms. This algorithm was further improved by introducing an adaptive VOTERS characteristic based on sample size and uniformity using the parameters learned from tens of thousands
of simulations.