zaterdag 31 mei 2014

2.1.1. Types of tables


The Oxford English dictionary defines a table as "a set of facts or figures systematically displayed, especially in columns".

Three main types of tables I distinguish (although variations on these exist) are:
  • Frequency tables
  • Grouped frequency tables
  • Contingency tables
A frequency table simply lists the possible values and the corresponding frequencies (counts). Table 2 shows an example of a simple frequency table.

Table 2
Example of a Frequency Table
Gender
Frequency
Female
28
Male
21

A grouped frequency table is similar as a frequency table but now the data has been grouped into classes, to avoid long tables. Table 3 shows and example of a grouped frequency table.

Table 3
Example of a Grouped Frequency Table

Income
Frequency
0 < 10
8
10 < 20
12
20 < 50
22
50 or more
5

A contingency table (or cross table) shows two variables; one in the rows and the other in the columns as for example shown in Table 4.

Table 4
Example of a Cross Table

Gender
Income
Female
Male
0 < 10
4
6
10 < 20
8
12
20 < 50
10
13
50 or more
7
5

Next entry: About constructing tables in general

vrijdag 30 mei 2014

2.1.2. About constructing tables in general


Common sense rules when it comes down to creating tables, but there are a few things each table should include:
  • A clear title for the table.
  • Clear column titles. Each column should be clear to what it shows. 'Income' for example is not very clear. It should mention somewhere either in the table title, column title or as a note that it is income in netto euro's per hour.
  • The source should also be listed. Where did the table come from.
Some formatting styles (such as APA and ASA) have fixed rules on where to place the above mentioned points, as well as which borders are and are not allowed to be drawn.

When creating classes you have to make sure that:
  1. The classes do not overlap (i.e. any value should only be able to fit in one of the classes, technically called mutual exclusive)
  2. Each value should be able to fit in one class (i.e. do not forget one or more classes, technically called inclusive or exhausting)
Since tables in descriptive statistics are used to summarize data and long tables are not so clear, it is often recommended not to create too many classes. There is no fixed rule for this, but between 5 and 20 is often seen as a rule of thumb. Based on tables often diagrams are drawn and those are often used as well to see how the data is distributed (a lot more on distribution will come later). Since the diagram will often depend on how the classes are created some far more technical methods are also available. These go beyond the scope of this blog, but from a few a summary can be found in Appendix B.

There is also some terminology that comes into play when talking about classes (or bins).

Symbols for class interval
There are a few different ways to represent the intervals. To indicate smaller than (but not equal to) the (in my opinion) most clear symbol would be <, and to indicate smaller than or equal to ≤. A more technical method is the use of [ or ] to indicate ‘including’ and ( or ) to indicate smaller than. The interval 2 < 5 is then the same as [2,5), and the interval 2 ≤ 5 is the same as [2,5].

Another symbol often used is a hyphen (-). It is however sometimes used as < is (Chaudhary, Kumar, & Alka, 2009; Sharma, 2007), and sometimes as ≤ is (Beri, 2010; Haighton, Haworth, & Wake, 2003a).

Class limit and Class boundary
If we have classes such as 0-2, 3-5, 6-10 the lower end is called the lower limit and the upper end the upper limit. Note however that a value of 2.498 would go into the first class of 0-2, so the true limits of the classes are actually 0 < 2.5, 2.5 < 5.5, 5.5 < 10.5. These are called the boundaries (the lower one the lower bound and the upper one the upper bound), also known as the true or closed limits. The class boundary is defined as "the value halfway between the upper limit of one class and the lower limit of the next class" (Kenney, 1939, p. 14). In many cases though people are not aware of the difference between boundaries and limits and interchange the words.

Class mark
The class mark is the average of the two limits (or boundaries) of that class. It is also known as midvalue or central value (Kenney, 1939).

This does however create a problem. To illustrate the problems an example. Let’s assume we have created classes on how many books people read a year as 0 < 2, 2 < 4 and so on. I could also write this as 0 ≤ 1, 2 ≤ 3 and so on.
To convert lower class limits into lower class boundaries we normally would simply subtract 0.5 and for the upper ones we add 0.5. For the class 2 ≤ 4 this would then become 1.5 and for the upper limit the boundary becomes 4.5. The class mark is then (1.5 + 4.5) / 2 = 3, however if I use the class limits the class mark would be (2 + 3) / 2 = 2.5. The 2.5 seems more appropriate in my opinion because we are talking about a discrete variable (nr. of books), so the class boundaries are only theoretical.

My recommendation is therefore to use < when the variable that is being grouped is continuous and ≤ when the variable is discrete, and define the class mark as: “The average of the two boundaries in case of a continuous variable that has been grouped and the average of the two limits in case of a discrete variable that has been grouped.”

Class interval and Class width
An interval is a range of values, so the class interval is the range of values that goes into that class. The interval can be stated either with the class limits or boundaries. The class width is how wide the class is and is also known as the class length (Bowerman et al., 2009).

Although frequencies are commonly used in tables, there is a variation of different types of frequencies.

>>Next section: Frequencies

References
Beri, G. C. (2010). Business Statistics (3rd ed.). New Delhi: Tata McGraw-Hill Education.
Bowerman, B. L., O’Connell, R. T., Murphree, E. S., Huchendorf, S. C., Porter, D. C., & Schur, P. J. (2009). Business statistics in practice. Boston, MA.: McGraw-Hill/Irwin.
Chaudhary, K. K. S., Kumar, A., & Alka. (2009). Statistics in Management Studies (10th ed.). Meerut: Krishna Prakashan Media.
Haighton, J., Haworth, A., & Wake, G. (2003). Statistics. Cheltenham: Nelson Thornes.
Kenney, J. F. (1939). Mathematics of Statistics; Part one. London: Chapman & Hall. Retrieved from http://archive.org/details/MathematicsOfStatisticsPartI
Sharma, J. K. (2007). Business Statistics (2nd ed.). Delhi: Pearson Education.

donderdag 29 mei 2014

2.1.3. Frequencies

Once we have data we often like to know 'how many...', which can be answered by generating frequency tables. The basic frequency simply counts how many data points belong to a specific group. These are known as the absolute frequencies.

In some instances it is also useful to report some other type(s) of frequency. The main three types are the relative frequency, cumulative frequency and frequency density.

Relative Frequency
The relative frequency is the frequency of a class/group in relation to the total frequency. It is defined as: "[absolute frequency] expressed as a fraction of the total frequency" (Kenney & Keeping, 1954, p. 17).

Often the relative frequency gets multiplied by 100 to yield the percentage.

Cumulative Frequency & Inverse Cumulative Frequency
The cumulative frequency keeps adding up (accumulate) and shows the number of cases that are equal or lower than the class/group. It is defined as: "the total (absolute) frequency up to the upper boundary of that class" (Kenney, 1939, p. 16) and the inverse cumulative frequency is the exact opposite: “the frequency of all values greater than the lower class boundary of a given class” (Kozak, Kozak, Staudhammer, & Watts, 2008, p. 13).
Since the cumulative frequency is interpreted with "equal or less than", and the inverse cumulative with "equal or more than", the variable should have a logical order, and hence cannot be used if the variable is of nominal level.

Frequency Density
As the name implies the frequency density shows how dense a class is that has been binned (e.g. 0< 10, 10 < 30, etc.), or in other words how crowded it is. It goes similar as population density (which is population divided by area), which means it can easily be determined by dividing the (absolute) frequency by the class width. 
Pearson (1895, p. 399) does not mention the term frequency density, but does mention that in histograms the area of the columns should equal the (absolute) frequency. Since the width of a column in a histogram is equal to the class width, and an area of a column is equal to the width times height, it is simple to deduce that the height should be the absolute frequency divided by the class width.
The frequency density is especially useful in cases when class widths are not equal for all classes.
Since the frequency density needs a class width it can only be determined for interval and ratio variables.
In some cases many classes have the same width except a few. Some authors (e.g. Barrow, 2009; Burton, Carrol, & Wall, 2002; Haighton, Haworth, & Wake, 2003) suggest then to set a ‘standard class width’ and determine the frequency density based on the standard class width.

Combinations
Combinations are also possible:
  • Relative Frequency Density
  • Cumulative Frequency Density
  • Cumulative Relative Frequency
  • Cumulative Relative Frequency Density
  • Inverse Cumulative Frequency Density
  • Inverse Cumulative Relative Frequency
  • Inverse Cumulative Relative Frequency Density.
The relative frequency density can be determined using the same reasoning as for relative frequencies, and simply divide the frequency density by the total frequency (Haighton, Haworth, & Wake, 2003, p. 74), or using the same reasoning as for frequency density itself and dividing the relative frequency by the class width (Kozak et al., 2008, p. 80). Both approaches will yield the same result.
Note that according to Petry & Friesen (2012) the cumulative relative frequency density is pointless to calculate. This is a bit strange because especially for histograms showing probabilities, these are frequently used.

Example with some interpertation

  • FD (Frequency Density)
    From the FD column we can see that although the 50 < 100 class has the most people in absolute frequencies, if we take the class width into consideration the 10 < 20 class is the most crowded (highest FD).
  • RF (Relative Frequency)
    If we would want to compare our data with another set of data that used the same class widths, but a different amount of cases, we could compare the RF's with each other to see where relatively speaking the most people would fall in absolute terms.
  • RFD (Relative Frequency Density)
    If we wanted to compare our data with another set of data that used different class widths we could compare the RFD's to compare which classes were most crowded.
  • CF (Cumulative Frequency)
    We can see immediately that 22 cases would fall in 0 < 50.
  • ICF(Inverse Cumulative Frequency)
    We can see immediately that 21 cases would fall in 10 < 100.
It possible to construct formulas for each type of frequency, but they often look scarier than the calculation actually is. However if you like you can find the formulas with an example calculation in Appendix C.

>>Next entry: Tables with Excel, SPSS or a TI-83

References
Barrow, M. (2009). Statistics for Economics, Accounting and Business Studies (5th edition.). Essex: Pearson Education.
Burton, G., George Carrol, & Wall, S. (2002). Quantitative Methods for Business and Economics (2nd edition.). Essex: Pearson Education.
Haighton, J., Haworth, A., & Wake, G. (2003). Statistics. Cheltenham: Nelson Thornes.
Kenney, J. F. (1939). Mathematics of Statistics; Part one. London: Chapman & Hall. Retrieved from http://archive.org/details/MathematicsOfStatisticsPartI
Kenney, J. F., & Keeping, E. S. (1954). Mathematics of Statistics; Part one (3rd edition.). New York: D. Van Nostrand Company. Retrieved from http://hdl.handle.net/2027/mdp.39015015725339
Kozak, A., Kozak, R. A., Staudhammer, C. L., & Watts, S. B. (2008). Introductory Probability and Statistics: Applications for Forestry and Natural Sciences. Wallingford: CABI.
Pearson, K. (1895). Contributions to the Mathematical Theory of Evolution. II. Skew Variation in Homogeneous Material. Philosophical Transactions of the Royal Society of London. (A.), 186, 343–414. doi:10.1098/rsta.1895.0010
Petry, R. G., & Friesen, B. (2012). STAT 100; Elementary Statistics for Applications. Campion College. Retrieved from http://amberlin.asuscomm.com/university_of_regina_copies/stat_100_lecture_notes_v2/intro_stats_v2.pdf

woensdag 28 mei 2014

2.1.4. Using technology to create tables

Excel

In Excel you can use the function COUNTIF or FREQUENCY to generate frequency tables. A third method might be to use a Pivot table.

The video below shows how you can use the COUNTIF and FREQUENCY functions.

 

SPSS

A basic frequency table in SPSS can easily be generated using either:
  • Analyze - Descriptive statistics - Frequencies or
  • Analyze - Tables - Custom tables.
To generate a cross table in SPSS you can use either:
  • Analyze - Descriptive statistics - Crosstabs or
  • Analyze - Tables - Custom tables.
To generate a table from a list of questions all using the same scale you can use:
  • Analyze - Tables - Custom tables
To create a table from a question where multiple answers were allowed, you will need to create a 'set' that lets SPSS know which questions were part of the multiple answer question. This can be done either by:
  • Analyze - Tables - Multiple Response Sets or
  • Analyze - Multiple Response - Define Variable Sets
To create a grouped frequency table in SPSS you will need to recode the variable into the categories. SPSS can do this either by using:
  • Transform - Recode into Different Variables or
  • Transform - Visual Binning

Below are some instructional videos for each of the above mentioned tables.

Basic frequency table


Cross tables

Table for same categories/scale

Table for multiple answer questions

Grouped frequency table

TI-83

On a TI-83 graphical calculator you can enter a frequency table.  You will need to create the seperate columns in the STAT - Edit menu.
For the cumulative frequency you can enter as a function for the column under 2nd - LIST - OPS the cumSum( option.
For the relative frequency you will need the sum of the frequency column. You can find the sum function under 2nd - LIST - MATH.
A video showing an example is shown below.

Video: Frequency table with TI-83

>> Next section: Side notes on tables and frequencies

dinsdag 27 mei 2014

2.1.5. Side notes on tables and frequencies

A few side notes related to tables and frequencies.

Tally marks
The absolute frequency is the count of the number of items belonging to a class or group. To do this counting, a tally mark system can be used. If we consider tally marks as part of the table, and tables as part of statistics, one might consider the Lembobo Bone as one of the oldest known statistics in history. The Lembobo bone (Figure 1) dates approximately 35,000 BC (Bogoshi, Naidoo, & Webb, 1987).


Figure 1. The Lembobo bone. Reprinted from Primitive numbers and a history of counting, by S. Chavda, 2011, Retrieved from http://www.swatichavda.com/2011/10/primitive-numbers-history-of-counting.html.

As for tally systems, there are a few popular ones across the globe, illustrated in Table 5:

Table 5
Tally systems



Simpson's paradox
Simpson (1951) describes the following situation: A person is interested in the proportion of court cards (King, Queen and Knave) as well as the clean vs. dirty cards. He checks two decks of cards a red deck and a black deck and finds the proportions as shown in Table 6 and Table 7.
 
Table 6
Card Distribution Among Dirty Cards
Dirty cards
Court
Plain
Red deck
4/12
8/12
Black deck
3/8
5/8

Assuming we prefer court cards, we can conclude that from the dirty cards the black deck is preferred (higher percentage of court cards than the black deck. Now let's look at the clean cards results:

Table 7
Card Distribution Among Clean Cards
Clean cards
Court
Plain
Red deck
2/14
12/14
Black deck
3/18
15/18

Also with the clean cards the black deck would be preferred. However, if we combine the dirty and clean cards we get the proportions shown in Table 8.

Table 8
Card Distribution Among All Cards 
All cards
Court
Plain
Red deck
6/26
20/26
Black deck
6/26
20/26

And now all of a sudden both the red and black deck is equal. This phenomenon is known as Simpson's paradox.

Others had noticed the same idea already earlier than Simpson did (Cohen & Nagel, 1934; Pearson, Lee, & Bramley-Moore, 1899; Yule, 1903). Blyth (1972) was perhaps the first to call it ‘Simpson’s paradox’, but should probably have called it Yule’s paradox. For more examples you can have a look at the Wikipedia entry and for a more technical explanation and details Pearl's (2013) article.

Benford's Law
If you look at a long list of real data numbers and would only look at the first digit (the most left one), it turns out that in 30.1% of the values it will be a 1, even though we might have expected it to be in only 1/9 = 11.1%. This is known as Benford's Law. Benford's (1938) article was published in 1938, but actually Newcomb (1881) had already discovered this in 1881.

>>Next section: Diagrams

References
Benford, F. (1938). The Law of Anomalous Numbers. Proceedings of the American Philosophical Society, 78(4), 551–572.
Blyth, C. R. (1972). On Simpson’s Paradox and the Sure-Thing Principle. Journal of the American Statistical Association, 67(338), 364–366. doi:10.1080/01621459.1972.10482387
Bogoshi, J., Naidoo, K., & Webb, J. (1987). The oldest mathematical artifact. The Mathematical Gazette, 71(458), 294.
Cohen, M. R., & Nagel, E. (1934). An introduction to logic and scientific method. New York: Harcourt, Brace and company.
Newcomb, S. (1881). Note on the frequency of use of the different digits in natural numbers. American Journal of Mathematics, 4(1), 39–40.
Pearl, J. (2013). Understanding Simpson’s Paradox (SSRN Scholarly Paper No. ID 2343788). Rochester, NY: Social Science Research Network. Retrieved from http://papers.ssrn.com/abstract=2343788
Pearson, K., Lee, A., & Bramley-Moore, L. (1899). Mathematical Contributions to the Theory of Evolution. VI. Genetic (Reproductive) Selection: Inheritance of Fertility in Man, and of Fecundity in Thoroughbred Racehorses. Royal Society of London. Retrieved from http://archive.org/details/philtrans07768035
Simpson, E. H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society, 13(2), 238–241.
Yule, G. U. (1903). Notes on the Theory of Association of Attributes in Statistics. Biometrika, 2(2), 121–134. doi:10.1093/biomet/2.2.121

zaterdag 24 mei 2014

2.2. Diagrams (charts, graphs and plots)

The terms diagram, chart, graph and plot often overlap in their definition and on this site I will not make any distinction between them. Diagrams are used to visualize data. There is a very long list of types of diagrams that can be used; the most common ones will be discussed on this site.

The purpose of a diagram should be to describe the data in a clear and efficient way. The first question to ask is, is a diagram necessary? Figure 2 shows an example of a diagram from Wittke-Thompson, Pluzhnikov & Cox (2005).

Figure 2. Example of a non-needed diagram. Reprinted from “Rational Inferences about Departures from Hardy-Weinberg Equilibrium,” by J.K. Wittke-Thompson, A. Pluzhnikov and N.J. Cox, The American Journal of Human Genetics, 76(6), p. 971.

The problem with this chart is that it should not exist. It shows that all values were zero. The author could have saved the space (it takes up about half the page) by simply writing that in text.

Ones you determined that a diagram can be of additional interest the next decision is to determine which type of diagram to use.
  • If you have one discrete variable, you could either use a circle diagram or a bar chart.
  • If you have one continuous variable, you could use a histogram or frequency polygon
  • If you have two discrete variables, you could either use a compound or clustered bar-chart
  • If you have two continuous variables you could use a scatterplot, or line

References
Wittke-Thompson, J. K., Pluzhnikov, A., & Cox, N. J. (2005). Rational Inferences about Departures from Hardy-Weinberg Equilibrium. The American Journal of Human Genetics, 76(6), 967–986. doi:10.1086/430507

vrijdag 23 mei 2014

2.2.1. Circle diagram (a.k.a. Pie chart)


As the name implies a circle diagram is a circle divided into slices, each representing a category as in the example shown in Figure 3. 


Figure 3. Example of a pie chart

A circle has 360 degrees, equal to 100%. So by multiplying the relative frequencies with 360, the degrees for each category can be found. This means that visually the circle diagram can only show the relative frequencies.

The circle diagram is quite popular and often used, but actually has a few disadvantages:
  • It can only show relative frequencies. To show other frequencies the numbers themselves have to be added.
  • When the relative frequencies are close to each other, the differences are not easily seen in a circle diagram.
  • When there are many categories the circle diagram will look very busy and not easily to read. 
  • The circle diagram can only show one variable.
  • Do not use a pie chart for results of a multiple answer questions
An example of a 'bad' circle diagram was created by a program called ‘The Archivist’ and used in an article online (Perez, 2009) is shown in Figure 4.


Figure 4. Example of a pie chart with too many slices. Reprinted from 10 Ways to Archive Your Tweets, by S. Perez, 2009, Retrieved from http://readwrite.com/2009/08/10/10_ways_to_archive_your_tweets.

There are clearly too many categories and the few biggest slices are difficult to tell which is bigger. A possible solution in this case could have been to group the smallest categories into one category 'Other' and then still use a bar-chart instead.

Another example of a 'bad' circle diagram comes from The Economist (as cited in Kyd, 2012) shown in Figure 5.


Figure 5. Example of pie chart with rel. freq. too close to each other. Reprinted from Good Examples of Bad Charts: Chart Junk from a Surprising Source, by C. Kyd, 2012, Retrieved from http://exceluser.com/blog/1133/good-examples-of-bad-charts-chart-junk-from-a-surprising-source.html


It is very difficult to see the differences. It is not until you look at the numbers that you can tell which slices are bigger or smaller (with a few exceptions). A bar-chart would have been preferred.

A third example comes from fox-news, shown in Figure 6.


Figure 6. Example of a pie chart with percentages not adding up to 100.

This seems pretty odd that the percentages do not add up to 100%. This is probably because people could choose multiple options. The size of the slices is then most somewhat irrelevant. A bar-chart would have been preferred here.

There are many variations to the circle diagram. Some only add a visual effect (moving one or more slices out of the center a.k.a. exploded, or creating a 3D effect), and some more complex variations (such as the polar charts made famous by Florence Nightingale, doughnut charts, spie charts, etc.).

History
The earliest found circle diagram is found on the inlay of a book by William Playfair (1801) shown in Figure 7:


Figure 7. Earliest found pie chart. Reprinted from The statistical breviary: shewing the resources of every state and kingdom (p. inlay), by W. Playfair, 1801, London: T. Bensley

The name 'pie chart' might come from a misspelling of the word Pi. Pi is often associated with a circle. It might also simply come from the resemblances with a pie (as in apple-pie). However Srivastava & Rego (2011) put forward another belief that it is named after a royal French cook Pie, who served dishes in a pie-chart shape.

>>Next section: Bar-chart

References
Kyd, C. (2012, April 5). Good Examples of Bad Charts: Chart Junk from a Surprising Source. ExcelUser. Retrieved from http://exceluser.com/blog/1133/good-examples-of-bad-charts-chart-junk-from-a-surprising-source.html
Perez, S. (2009, August 10). 10 Ways to Archive Your Tweets. ReadWrite. Retrieved April 27, 2014, from http://readwrite.com/2009/08/10/10_ways_to_archive_your_tweets
Playfair, W. (1801). The statistical breviary: shewing the resources of every state and kingdom. London: T. Bensley. Retrieved from http://archive.org/details/statisticalbrev00playgoog. New edition on Amazon
Srivastava, T. N., & Rego, S. (2011). Business Research Methodology. New Delhi: Tata McGraw-Hill.

donderdag 22 mei 2014

2.2.2. Bar chart

As the name implies a bar-chart is data represented by bars. It is a common type of chart for discrete data. An example of a bar-chart is shown in Figure 8.

Figure 8. Example of a bar-chart

Note that the width of each bar is equal, there are gaps between the bars (to emphasize the discrete character), and the vertical (y) axis represents the frequencies. These points will be different in a histogram. It is also helpful if the bars are sorted from large too small.

An advantage of a bar-chart is that it can also compare two variables. This can either be done by adding the two on top of each other (known then as a compound- or stacked bar-chart) as shown in Figure 9, or next to each other (known as a clustered- or multiple bar-chart) as shown in Figure 10.

Figure 9. Example of a compound bar chart



Figure 10. Example of a clustered bar chart

A few notes on drawing a bar-chart
As a guideline for the size of the bar there is a rule of thumb known as the 'three quarter high rule' (Pitts, 1971). It means that the height of the y-axis should be 3/4 of the length of the horizontal x-axis. So if the horizontal axis is 20 cm long, the vertical axis should be 3/4 * 20 = 15 cm high.

According to Singh (2009) vertical bars (instead of horizontal bars as for example in Figure 11) are preferred since they are easier on the eye. However if you have long category names some names might become unreadable. A bar chart with the bars placed horizontally might then be preferred.


History
Although the diagrams used by Nicole Oresme (1486) do look like a bar-chart, they were mainly used to illustrate a theoretical concept and not so much as a descriptive statistic. The earliest known bar-chart (Figure 11) used as a descriptive comes again from William Playfair (1786).


Figure 11. Earliest known bar chart. Reprinted from The commercial and political atlas (p. XX), by W. Playfair, 1786, London: Debrett; Robinson; and Sewell

>>Next section: Histogram

References
Singh, G. (2009). Map Work And Practical Geography (4th ed.). New Delhi: Vikas Publishing House Pvt Ltd.  Available at Amazon

Pitts, C. E. (1971). Introduction to Educational Psychology: An Operant Conditioning Approach. New York: Crowell. Available at Amazon
Oresme, N. (1486). Tractatus de latitudinibus formarum. (B. Pelacani da Parma, Ed.). Padua: Mathaeus Cerdonis. Retrieved from http://catalog.hathitrust.org/Record/010883454

Playfair, W. (1786). The commercial and political atlas. London: Debrett; Robinson; and Sewell. New edition available on Amazon.

woensdag 21 mei 2014

2.2.3. Histogram


The term histogram was introduced by Pearson (1895) and defined as "...a term for a common form of graphical representation, i.e., by columns marking as areas the frequency corresponding to the range of their base" (p. 399).

Figure 12 shows an example of a histogram.


Figure 12. Example of a histogram.

The area of a rectangle is the width x height, which should equal the (absolute) frequency and since the width is determined by the class width, we obtain the following equation: Class Width x Height = (Absolute) Frequency. From this we can deduce that Height = Absolute Frequency / Class Width, which is the same formula for the Frequency Density. Therefore the height of the bars in a histogram is determined by the frequency density and NOT the absolute frequency itself (which is represented by the area of the bar).

Note that many books and software programs do use the absolute frequency as the height in a histogram. When all classes have the same width this is not a big problem, but when they vary it is misleading.

Bar-charts and histograms are often incorrectly considered to be the same. An overview of the main differences is summarized in Table 9.

Table 9
Differences between Bar-charts and Histograms

Bar-chart
Histogram
Type of data
Discrete
Continuous
Width of the bars / bins
Freely to choose, but all bars the same width
Depends on the class width
Height of the bars / bins
Any type of frequency
Any type of frequency density
Positioning of bars / bins
Small gaps between the bars to highlight the discrete data type
No gaps

 
>> Next entry: Charts with lines
 
References
Pearson, K. (1895). Contributions to the Mathematical Theory of Evolution. II. Skew Variation in Homogeneous Material. Philosophical Transactions of the Royal Society of London. (A.), 186, 343–414. doi:10.1098/rsta.1895.0010

dinsdag 20 mei 2014

2.2.4. Charts with lines

Line chart (or Line graph)

Definition: "Graphical device that displays quantitative information or illustrates relationships between two changing quantities (variables) with a line or curve that connects a series of successive data points." (BusinessDictionary.com, n.d.).

An example is shown in Figure 13.


Figure 13. Example of a line chart.
One of the two variables is often time (in months, years, etc.). The line charts are often used to show a trend.

Frequency polygon

Definition: "A distribution of discrete variates may be represented graphically by plotting the points, and drawing a broken line through them." (Kenney, 1939).

An example is shown in Figure 14.


Figure 14. Example of a polygon.
The definition by Kenney mentions the frequency polygon is used for discrete variates. However the frequency polygon is often used for grouped tables.

Ogive

Definition: "The graphs of cumulative frequencies" (Kenney, 1939).

An example is shown in Figure 15.


Figure 15. Example of an ogive.

Pareto diagram/plot

According to some authors a Pareto diagram is any diagram with the bars in order of size (Joiner, 1995; WhatIs.com, n.d.), while others suggest that a line representing the cumulative relative frequencies should also be included (Weisstein, n.d.).

In below are two examples. The first example (Figure 16) uses one vertical axis, the second (Figure 17) uses two.


Figure 16. Example of a Pareto diagram with one vertical axis.

Figure 17. Example of a Pareto diagram with two vertical axis.

Pareto (1919) described that in Italy 80% of the land was owned by 20% of the population. He noticed similar divisions in other countries. It is unclear who the first was to name the diagram after him.

>>Next section: Using technology to create diagrams

References
BusinessDictionary.com. (n.d.). What is line graph? definition and meaning. Retrieved April 20, 2014, from
http://www.businessdictionary.com/definition/line-graph.html
Joiner. (1995). Pareto charts: plain & simple. Madison, Wisc.: Joiner Associates.
Kenney, J. F. (1939). Mathematics of Statistics; Part one. London: Chapman & Hall. Retrieved from
http://archive.org/details/MathematicsOfStatisticsPartI
Pareto, V. (1919). Manuale di economia politica: con una introduzione alla scienza sociale. Milano: Societa Editrice Libraria. Retrieved from http://archive.org/details/manualedieconomi00pareuoft
Weisstein, E. W. (n.d.). Mode from MathWorld. Text. Retrieved April 5, 2014, from http://mathworld.wolfram.com/Mode.html
WhatIs.com. (n.d.). What is Pareto chart (Pareto distribution diagram)? Retrieved April 20, 2014, from http://whatis.techtarget.com/definition/Pareto-chart-Pareto-distribution-diagram

maandag 19 mei 2014

2.2.5. Using technology to create diagrams

Excel
Diagrams can easily be added in Excel. You can either start from scratch, or first select the data you want to see and then go to the Insert ribbon and select the chart you like to use.

When a chart has been selected you also get the Cart ribbon with plenty of options to adjust the chart.
Excel uses the term column chart for bar-charts and will require some work to create a histogram.

A video showing a few examples is shown below.


SPSS
In SPSS there are a few different ways to create graphs:
  • Graphs – Chart Builder
    This option lets you choose from various graphs and select the variables you want to use. It shows a visual draft of the chart your building.
  • Graphs – Graphboard Template Chooser
    This option lets you choose the variable(s) you want a graph of and then suggests a few different charts. The disadvantage of this option is that the chart created cannot be changed much once it is created.
  • Graphs – Legacy Dialogs
    This option you simply click on the chart you want and then fill out the options for each chart.
It is also possible to create a graph from a table in the output. Double click on the table, select the data you want a graph of, right-click and select the graph you want.

Many other options in the analysis menu each have an option to also immediately add a chart to the output.

Editing a chart in SPSS is done simply by double clicking on the chart. This opens the chart editor (unless you used the graphboard template chooser, then it will open up the graphboard editor which has far less functionality).

A video showing the above mentioned different methods is shown below


TI-83

Unfortunately the graphing capabilities of this graphical calculator are a bit limited and focused on graphing mathematical functions, not so much on graphing data.

By setting up two lists (see tables section) it is possible to get a line graph and a bar chart (but with the columns connected to each other). This can be done by using [2nd] [STAT PLOT].

A video showing an example of this can be found below.

vrijdag 16 mei 2014

2.3.1. Mode

What is it?

The mode is the most common value obtained in a set of observations (Weisstein, n.d.).
If two or more items have the highest occurrence, they are all the mode. If each value occurs equally often, there is no mode.

Visually the mode (or modes) can be seen by the highest peak(s), as illustrated in the animated gif below (Figure 18).

Figure 18. Visual representation of the mode.


If there is only one mode the term unimodal is sometimes used, when there are two the term bimodal, with three trimodal, etc. Alternatively the term multimodal is also sometimes used when there are two or more modes.

Strictly speaking the items with the highest frequencies are the mode (unless they are all equal), but in the example below (Figure 19) it is clear that there are two peaks, so one could argue that there are two modes.

Figure 19. Example of a data set with two local maximums.

In the above example the mode is 2, but 8 is also somewhat of a peak. This is known as a local maximum, and those are often considered to be part of the mode as well. The Oxford Dictionary of Statistics even considers any class whose neighboring classes frequency are lower to be a modal class (Upton & Cook, 2014, pp. 272–273). There are even some statistical tests to check if the data can be considered bimodal. One of these is discussed as a side note.

Examples

On a survey ten people were asked to rate the design of the product on a scale of Very nice, Nice, Ugly, Very Ugly. The responses the researcher got were: Very Nice, Nice, Nice, Nice, Nice, Ugly, Ugly, Ugly, Very ugly, Very ugly. Since only Nice has a frequency of four and all others occur less often the mode is Nice. In this case the data is unimodal.

The following grades were obtained by various students: 2, 4, 5, 5, 5, 8, 9, 9, 9. Since five and nine are the only grades that appear three times, the others only once and no grade appears four or more times, the mode is 5 and 9. In this case the data is bimodal. Note that you do NOT average the two.

In a group of 6 students the genders are: Male, Male, Female, Male, Female, Female. Since both male and female occur three times there is no mode.

A special case is when data is grouped in classes. The class (or classes) with the highest frequency density is (or are) then considered the modal class (Kenney & Keeping, 1954).

Why (not) use the mode?

The main advantage of the mode is that it is the only measurement of central tendency for variables on a nominal measurement level (e.g. Gender).

The disadvantage of the mode is that it ignores all other values. If we have a 2, 2, 70, 80, 90 , 100 the mode would be 2, but all other values are a lot higher. The 2 then does not really represent the center very well, making the mode not a good measure of central tendency.

Note that the often heard modal income, might vary slightly from the true statistical meaning. Check the source when reporting modal income, how they define it. For example the Dutch ‘Centraal planbureau’ calculates the modal income differently than simply looking at the modal class.

Debate

There are two notes about determining the mode. The first is in the case where all values have the same frequency (a so called uniform distribution). Most textbooks would say that in those cases there is no mode, but there are a few that will mention that all items are then the mode. 
The second is in the case of multiple modes. There are a few textbooks that might argue since not a single item has the highest frequency, there is no mode (e.g. Johnson & Kuby, 2011, p. 66).

History

The earliest found reference to the mode can be found in the work from Pearson: "I have found it convenient to use the term mode for the abscissa corresponding to the ordinate of maximum frequency. Thus the "mean," the "mode," and the "median" have all distinct characters." (1895, p. 345).

The word mode is derived from the French which means ‘fashion’ (which was in itself probably from the latin modus). So it started with asking ‘what is fashionable’, which was what are most people wearing.

Strong (1901) mentions the term bimodal “…is seen to be distinctly bimodal” (p. 286).

>>Next section: Median

References
Johnson, R., & Kuby, P. (2011). Elementary Statistics. Cengage Learning.
Kenney, J. F., & Keeping, E. S. (1954). Mathematics of Statistics; Part one (3rd edition.). New York: D. Van Nostrand Company. Retrieved from http://hdl.handle.net/2027/mdp.39015015725339

Pearson, K. (1895). Contributions to the Mathematical Theory of Evolution. II. Skew Variation in Homogeneous Material. Philosophical Transactions of the Royal Society of London. (A.), 186, 343–414. doi:10.1098/rsta.1895.0010
Strong, R. M. (1901). A Quantitative Study of Variation in the Smaller North-American Shrikes. The American Naturalist, 35(412), 271–298.
Upton, G. J. G., & Cook, I. (2014). A dictionary of statistics (3rd ed.). Oxford: Oxford University Press

Weisstein, E. W. (n.d.). Mode from MathWorld. Text. Retrieved April 5, 2014, from http://mathworld.wolfram.com/Mode.html