donderdag 1 mei 2014

Appendix B: Number of classes

Depending on the classes the distribution of data will look different, which is why the number of classes is important.

When binning numerical data into classes the question about the number of classes (k) comes up or simarly the class width (h). There are quite a lot of different methods to determine this and to my knowledge no single agreed upon method has yet been developed. The topic is perhaps surprisingly complex and diverse. Some authors simply avoid the topic by leaving it with "do what you think is best", or only give a rule of thumb. An interesting overview is available in the dissertation of Lohaka (2007). SPSS uses quite sophisticated procedures from Fayyad & Irani (1993), Dougherty, Kohavi, & Sahami (1995) and Liu, Hussain, Tan, & Dash (2002).

The symbols used in the formulas below are:
  • k
    The number of classes
  • h
    The class width
  • R
    The range.
  • n or N
    The number of data points.
  • s
    The unbiased sample standard deviation.
  • IQR
    The inter quartile range.
  • ⌈...⌉
    The so called ceiling function. It simply means to round up to the nearest integer (whole number), e.g. ⌈3.01⌉ = 4.
  • ⌊...⌋
    The floor function. It simply means to round down to the nearest integer (whole number), e.g. ⌊3.99⌋ = 3.

  • The square root. It in essence asks which number multiplied by itself will produce this number, e.g. √(16) = 4, because 4 * 4 = 16. 

Pre-determined class width (h)


Square root choice
This method is the default in MS Excel an early mentioning of it is found in Davies & Goldsmith 1980, p. 11 as cited in Lohaka 2007


Sturges' Rule
The formula from Sturges (1926, p. 65) is often used. He wrote himself 3.322 in the formula, but from the text it's clear he actually used log[2].




Rice's Rule
Rice (as cited by Lane, n.d.) suggests to use twice the cube root of the sample size.


Cencov's Rule
According to Cencov (1962) the number of bins should be equal to the cube root of the number of data points.


Terrell & Scott Rule
Terrell & Scott (1985) note that Sturges’ rule can only be used for finite ranges, they suggest to use:


Doane's Formula
Doane (1976) builds on Sturges rule, but adjust so his formula can also be used for skewed data. In his formula the skewness statistic plays a vital role.


Scott's normal reference rule
Scott (1979) originally proposed to use 3.49 in his formula, but later (1992) simplified this to:


Freedman-Diaconis' choice
Freedman and Diaconis (1981) use the IQR which is less sensitive to outliers than the standard deviation.


Shimazaki and Shinomoto
Shimazaki and Shinomoto (2007) take a different approach. They argue for the following:
  • Divide the data range into k bins of width C and determine the frequency for each bin.
  • Calculate the average frequency (m) and the biased sample variance (s)
  • Calculate f(m, s) = (2*m – s) / C2
  • Repeat steps 1-3 until the lowest f(m,s) has been found.
On http://toyoizumilab.brain.riken.jp/hideaki/res/histogram.html a java applet is provided to help with the above.

References
Cencov, N. N. (1962). Evaluation of an unknown distribution density from observations. Soviet Mathematics, 3, 1559–1562.
Doane, D. P. (1976). Aesthetic Frequency Classifications. The American Statistician, 30(4), 181–183. doi:10.2307/2683757
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and Unsupervised Discretization of Continuous Features. In A. Prieditis & S. Russell (Eds.),  (pp. 194–202). Presented at the Twelfth International Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann. Retrieved from http://citeseer.ist.psu.edu/109288.html
Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. (pp. 1022–1029). Presented at the International Joint Conference on Artificial Intelligence, Chambéry: Morgan Kaufmann. Retrieved from http://dblp.uni-trier.de/db/conf/ijcai/ijcai93.html
Freedman, D., & Diaconis, P. (1981). On the histogram as a density estimator. Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete, 57(4), 453–476. doi:10.1007/BF01025868
Lane, D. M. (n.d.). Online Statistics Education: An Interactive Multimedia Course of Study. Rice University. Retrieved from http://onlinestatbook.com/
Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An Enabling Technique. Data Mining and Knowledge Discovery, 6(4), 393–423. doi:10.1023/A:1016304305535
Lohaka, H. O. (2007, November). Making a Grouped-Data Frequency Table: Development and Examination of the Iteration Algorithm (Doctoral dissertation). Ohio University, Ohio. Retrieved from https://etd.ohiolink.edu
Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66(3), 605–610. doi:10.1093/biomet/66.3.605
Scott, D. W. (1992). Histograms: Theory and Practice. In Multivariate density estimation: theory, practice, and visualization. New York: Wiley.
Shimazaki, H., & Shinomoto, S. (2007). A Method for Selecting the Bin Size of a Time Histogram. Neural Computation, 19(6), 1503–1527. doi:10.1162/neco.2007.19.6.1503
Sturges, H. A. (1926). The Choice of a Class Interval. Journal of the American Statistical Association, 21(153), 65–66. doi:10.1080/01621459.1926.10502161
Terrell, G. R., & Scott, D. W. (1985). Oversmoothed Nonparametric Density Estimates. Journal of the American Statistical Association, 80(389), 209–214. doi:10.2307/2288074

Geen opmerkingen:

Een reactie posten