vrijdag 9 mei 2014

2.4.1. Quantiles, Fractiles: Quartiles, Hinges, Quintiles, Octiles, Deciles and Percentiles

Quantiles (also known as fractiles) is a general term described as quantities that are corresponding to different fractions of the frequency (Kenney & Keeping, 1954), some of these have been given special names.
  • Quartiles: Dividing the data into four sections
  • Quintiles: Dividing the data into five sections
  • Octiles: Dividing the data into eight sections
  • Deciles: Dividing the data into ten sections
  • Percentiles: Dividing the data into 100 sections
The most common one used is probably the Quartiles. These are the values at 25% intervals, each with their own name:
  • Q0 = 0% =zero quartile = minimum
  • Q1 = 25% = first quartile, lower quartile or lower median
  • Q2 = 50% = second quartile = median
  • Q3 = 75% = third quartile, upper quartile or upper median
  • Q4 = 100% = fourth quartile = maximum
Determining which value is the first and third quartile turns out to be trickier than perhaps first thought. In total I managed to find 14 different ways to compute the quartiles, each with a slightly different result depending on the dataset. I’ll discuss three methods to illustrate a few different approaches. For those really interested in the details from the other methods I recommend to read the article Quartiles in Elementary Statistics from Langford (2006).

If you have a multiple of 4 ordinal or scale items, most methods agree. The first quartile will be then between the n/4th item and the n/4+1-th item, the third quartile will be between the n/4x3-th item and the n/4x3+1-th item. If we had for example 8 items, the first quartile would be between the 2nd and 3rd item, and the third quartile between the 6th and 7th. This nicely divides the data into four groups as illustrated in Figure 23.

Figure 23. General idea of quartiles

What if we have 9, 10 or 11 items? One approach is to re-arrange the items in a W or M shape. These are then known as hinges (as Tukey (1977) called them), or inclusive quartiles. Figure 24 below illustrates this approach.

Figure 24. Illustration of Tukey hinges

In the figure above you can see that if there are 8 items, the 2.5th item will be the lower quartile, and the 6.5th item the upper quartile, if there are 9 items the 3rd and 7th item, in case of 10 items the 3rd and 8th and in case of 11 items the 3.5th and 8.5th.

If you look closely you might notice that hinges are at the median of the upper and lower half of the data. In the case where n is odd, this does mean that the median itself is counted twice. This explains why it is also sometimes called the inclusive method. Some authors disagree with this and suggest to exclude the median when determining the quartiles as illustrated in Figure 25.

 
Figure 25. Exclusive method illustration for quartiles

In the exclusive method, the median itself is removed from the data set and then the median from the lower part and upper part form the first and third quartile. If there are 8 items, the lower quartile will be the 2.5th item and the upper quartile the 6.5th. In case of 9 items it will be the 2.5th and 7.5th item, with 10 items the 2.5th and 8.5th and with 11 items the 3rd and 9th.

Another approach is focusses on the 25% idea from a quartile. If we have n items, then 25% of this will be at n x 0.25. If this is an integer (i.e. n is divisible by four) then the first quartile will be the average of n x 0.25-th item and the next one (n x 0.25 + 1-th). If it’s not an integer we simply round up. This approach is known as the cumulative or empirical distribution function and is the method eventually proposed by Langford (2006). The method is illustrated in Figure 26.

Figure 26. Illustration of the cumulative distribution function for quartiles

If there are 8 items then for the lower quartile we obtain 8 x 0.25 = 2, which is an integer so it will be between the 2nd and 3rd item. For the upper quartile we obtain 8 x 0.75 = 6, which is an integer so it will be between the 6th and 7th item. If there are 9 items we get 9 x 0.25 = 2.25, which gets rounded up to 3, so the lower quartile will be the 3rd item. For the upper quartile we get 9 x 0.75 = 6.75 =round up> 7. If there are 10 items the calculations go resp. 10 x 0.25 = 2.5 =round up> 3rd item, and 10 x 0.75 = 7.5 =round up> 8th item. If there are 11 items we get 11 x 0.25 = 2.75 =round up> 3rd item, and 11 x 0.75 = 8.25 =round up> 9th item.

The empirical distribution method can easily be adopted to determine the quintiles (simply use 0.2, 0.4, 0.6 and 0.8 instead of 0.25 and 0.75), deciles (use multiples of 0.1), octiles (use multiples of 0.125 and percentiles (use any value between 0 and 1).

Note that averaging out two values is only possible if numbers are used. If an ordinal data set is used this might not always be possible. One could then simply write something like “the lower quartile of the t-shirts sizes falls between small and medium”.

History
The first known use of the terms upper quartile and lower quartile comes from McAlister (1879). The term Hinges was introduced by Tukey (1977). The general term Quantiles is found was used by Kendall (1940). Galton (1885) uses the term percentile to describe the median as the 50th percentile and had used the term decile (1881) as well. Fisher, Thornton, & Mackenzie (1922) introduce the term quintiles.

>>Next section: Outliers, unusual scores and z-scores
 
References
Fisher, R. A., Thornton, H. G., & Mackenzie, W. A. (1922). The Accuracy of the Plating Method of Estimating the Density of Bacterial Populations. Annals of Applied Biology, 9, 325–359.
Galton, F. (1881). Report of the Anthropometric Committee. Report of the British Association for the Advancement of Science, 51, 225–272.
Galton, F. (1885). Some results of the Anthropometric Laboratory. Journal of the Anthropological Institute, 14, 275–287.
Kendall, M. G. (1940). Note on the Distribution of Quantiles for Large Samples. Supplement to the Journal of the Royal Statistical Society, 7(1), 83–85.
Kenney, J. F., & Keeping, E. S. (1954). Mathematics of Statistics; Part one (3rd ed.). New York: D. Van Nostrand Company.
Langford, E. (2006). Quartiles in Elementary Statistics. Journal of Statistics Education, 14(3). Retrieved from http://www.amstat.org/publications/jse/v14n3/langford.html
McAlister, D. (1879). The Law of the Geometric Mean. Proceedings of the Royal Society of London, 29(196-199), 367–376. doi:10.1098/rspl.1879.0061
Tukey, J. W. (1977). Exploratory Data Analysis. Pearson.

Geen opmerkingen:

Een reactie posten