Chi Square Test

The Chi Square Test is a common method of analysing data and comparing it with a pre-determined hypothesis. It was first derived by Karl Pearson in the 1900s. Its formula is as follows:

formula

O stands for the observed frequency

E stands for the expected frequency

stands for the test statistic

One of the important functions of this test is to accept or reject what is known as a null hypothesis. Null hypothesis basically means that the observed frequency is the same as the expected frequency. Another reason for conducting this test is to see if there are any other extraneous factors influencing the outcome of the test. For conducting this test there has to be a minimum of at least two variables. There are two terms that are important with regard to this test; degrees of freedom and critical values. Degrees of freedom, is basically the number of variables minus one. If there are two variables the degrees of freedom would be one, if there are six variables then it would be five. Many tests assume the critical value to be 0.05, i.e. we expect a 95% correlation between the observed frequency and the expected frequency. The statistic for a degree of freedom equal to one and a critical value of 0.05 is 3.841. If the test statistic is lower than this number of 3.841, it means that we can accept our null hypothesis. If it is higher, we have to reject the hypothesis.

The best way to understand the Chi Square Test is to look at an example. Let us flip a coin fifty times. There are two variables or two possible outcomes in this example, heads and tails. That means that the degree of freedom is 1. We state a null hypothesis that says we should get heads 25 times and tails 25 times. On flipping the coin we get heads 27 times and tails 23 times. We insert these numbers in the formula and get 0.32 as the statistic. Since this is lower than the critical value of 3.841 this means that we are accepting our null hypothesis. What has been observed from the experiment is nearly identical to what we had expected to be the answer.

This test is usually performed on very large sample sizes since a smaller population might give an incorrect answer. It is also assumed that every variable in the sample has an equal probability of being selected.