Statistics: 2.4.2. Outliers, Unusual Scores and Z-scores

If you have entered the responses from 50 people on 10 questions, you have entered 50 x 10 = 500 observations. It isn’t unlikely that you might have made a small mistake. If this is just a small issue (e.g. you entered a 23 as the age instead of 25 for one person) it will not have a huge impact, but if you entered a zero too many it might (e.g. you entered 230 as the age instead of 23). By checking for outliers or unusual scores you might detect such an error.

Sometimes an outlier is very interesting. How comes this person is so old but still plays more video games than a teenager? Although often outliers are sometimes simply removed from the data, this should only be done with great care. Outliers are as the name implies not very common, but are part of the data and population.

A simple method of detecting outliers is by generating a graph of your data and simply look for any unusual observations. This method might not be very scientific, but is highly recommended. There are of course also other methods developed.

The APA Dictionary of Statistics and Research methods simply defines an outlier as “an extreme observation or measurement, that is, a score that significantly differs from all others obtained” (Zedeck, 2014, p. 248). This does leave the option open as to define when something is ‘significantly different’.

One method sometimes used is to define an outlier as any data point that is more than a fixed number of standard deviations away from the mean. What this fixed number should be varies per source. Aggarwal (2013) mentions 3, Miller (1991) suggest 2.5 or even 2 and Cousineau & Chartier (2010) suggest to let it depend on the sample size. The number of standard deviations a value is above or below the mean, is known as the z-score of that number. This z-score can easily be obtained by subtracting the mean from the number (so you have the difference with the mean), and then divide by the standard deviation (to obtain how many times this difference will fit into the standard deviation).

A problem with this method is that the mean and standard deviation themselves are effected by outliers. Another approach is therefore to use the median instead of the mean, and either the median absolute deviation instead of the standard deviation (Iglewicz & Hoaglin, 1993; Leys, Ley, Klein, Bernard, & Licata, 2013), or the Inter Quartile Range (Carling, 2000).

A third method is to consider every item that is 1.5 times the Interquartile Range above the third quartile, or below the first to be a potential outlier, and each value that is even three times the IQR above the 3rd quartile or below the 1st quartile an extreme potential outlier (Tukey, 1977).

Besides the three methods described above there are also more advanced techniques, such as Grubb’s test (Grubbs, 1950), Dixon’s Q test (Dixon, 1950) and the Extreme Studentized Deviation test (Rosner, 1983). Entire books have been written about detecting outliers (Aggarwal, 2013; Barnett & Lewis, 1994; Rousseeuw & Leroy, 2003).

Two things though to be clear:

If you have one or more outliers, it does not automatically mean these answers were wrong
If you do not have any outliers it does not automatically mean you haven’t made any mistakes in your data entry.

History
Pearson (1903) mentions the term outlier, but the study of unusual scores is a lot older. In Chauvenet (1863) discusses “criterion for the rejection of doubtful observations” (p. 558).

References
Aggarwal, C. C. (2013). Outlier analysis. New York: Springer.
Barnett, V., & Lewis, T. (1994). Outliers in Statistical Data (3rd ed.). Chichester ; New York: Wiley.
Carling, K. (2000). Resistant outlier rules and the non-Gaussian case. Computational Statistics & Data Analysis, 33(3), 249–258. doi:10.1016/S0167-9473(99)00057-2
Chauvenet, W. (1863). Appendix. Method of Least Squares. In A Manual of Spherical and Practical Astronomy: Theory and use of astronomical instruments, method of least squares (1st ed., Vol. 2, pp. 469–566). Philadelphia: J. B. Lippincott & Company.
Cousineau, D., & Chartier, S. (2010). Outliers detection and treatment: a review. International Journal of Psychological Research, 3(1), 58–67.
Dixon, W. J. (1950). Analysis of Extreme Values. The Annals of Mathematical Statistics, 21(4), 488–506. doi:10.1214/aoms/1177729747
Grubbs, F. E. (1950). Sample Criteria for Testing Outlying Observations. The Annals of Mathematical Statistics, 21(1), 27–58. doi:10.1214/aoms/1177729885
Iglewicz, B., & Hoaglin, D. C. (1993). How to detect and handle outliers. Milwaukee, Wis: ASQC Quality Press.
Leys, C., Ley, C., Klein, O., Bernard, P., & Licata, L. (2013). Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49(4), 764–766. doi:10.1016/j.jesp.2013.03.013
Miller, J. (1991). Short report: Reaction time analysis with outlier exclusion: Bias varies with sample size. The Quarterly Journal of Experimental Psychology Section A, 43(4), 907–912. doi:10.1080/14640749108400962
Pearson, K. (1903). On the Laws of Inheritance in Man: I. Inheritance of Physical Characters. Biometrika, 2(4), 357–462. doi:10.2307/2331507
Rosner, B. (1983). Percentage Points for a Generalized ESD Many-Outlier Procedure. Technometrics, 25(2), 165. doi:10.2307/1268549
Rousseeuw, P. J., & Leroy, A. M. (2003). Robust Regression and Outlier Detection (1st ed.). Hoboken, NJ: Wiley-Interscience.
Tukey, J. W. (1977). Exploratory Data Analysis. Pearson.
Zedeck, S. (Ed.). (2014). APA dictionary of statistics and research methods (First edition.). Washington, DC: American Psychological Association.

Statistics

Pagina's

donderdag 8 mei 2014

2.4.2. Outliers, Unusual Scores and Z-scores

Geen opmerkingen:

Een reactie posten