14  Chi-squared Test

14.1 Goodness of Fit Test

In this section we use a chi-square test to determine whether a population being sampled has a specific probability distribution.

14.1.1 A Multinomial Population

Multinomial Experiment

A multinomial experiment is one that possesses the following properties.

  1. The experiment consists of a fixed number \(n\) of trials.

  2. The outcome of each trial can be classified into one of \(k\) categories, called \(cells\).

  3. The probability \(p_i\) that the outcome will fall into cell \(i\) remains constant for each trial. Moreover, \(p_1 + p_2 + ...+ p_k = 1\)

  4. Each trial of the experiment is independent of the other trials.

Testing Market Shares

Company A has recently conducted aggressive advertising campaigns to maintain and possibly increase its share of the market (currently \(45\%\)) for fabric softener. Its main competitor, company B, has \(40\%\) of the market, and a number of other competitors account for the remaining \(15\%\).

To determine whether the market shares changed after the advertising campaign, the marketing manager for company A solicited the preferences of a random sample of 200 customers of fabric softener.

Of the 200 customers, 102 indicated a preference for company A’s product, 82 preferred company B’s fabric softener, and the remaining 16 preferred the products of one of the competitors. Can the analyst infer at the \(5\%\) significance level that customer preferences have changed from their levels before the advertising campaigns were launched?

We recognize this experiment as a multinomial experiment, and we identify the technique as the chi-squared goodness-of-fit test. Because we want to know whether the market shares have changed, we specify those precampaign market shares in the null hypothesis.

\[ H_0: p_1=0.45;\ \ p_2=0.40; \ \ p_3=0.15 \]

The alternative hypothesis attempts to answer our question, Have the proportions changed? Thus,

\[ H_1: At \ \ least \ \ one \ \ p_i \ \ is \ \ not\ \ equal\ \ to\ \ its\ \ specified\ \ value \]

Chi-Squared Goodness-of-Fit Test Statistic

\[ \chi^2 =\sum_{i=1}^k \frac{(f_i-e_i)^2}{e_i} \] \(Where, \ \ f_i=observed \ \ frequency \ \ and \ \ e_i=expected \ \ frequency\)

Note that, \(e_i=n*p_i\)

The sampling distribution of the test statistic is approximately chi-squared distributed with \(k-1\) degrees of freedom, provided that the sample size is large.

Test Statistic calculation

Company Observed frequency, \(f_i\) Expected frequency, \(e_i\) \((f_i-e_i)\) \(\frac{(f_i-e_i)^2}{e_i}\)
A 102 90 12 1.60
B 82 80 2 0.05
Other 16 30 -14 6.53
Total 200 200 \(\chi^2=8.18\)

Critical value

At \(\alpha =0.05\) and for \(df=3-1=2\), \(\chi^2_\alpha=5.99\) .

Decision

Since \(\chi^2 > \chi^2_{\alpha}\) so reject null hypothesis.

Interpretation/ Conclusion

There is sufficient evidence at the \(5\%\) significance level to infer that the proportions have changed since the advertising campaigns were implemented.

14.1.2 Normal population (continuous)

To test whether a variable follows normal distribution with mean \(\mu\) and variance \(\sigma^2\) we will illustrate the following example.

Example A random sample of 500 car batteries was taken and the life of each battery was measured. Letting X denote battery life in years, suppose that the sample revealed the following distribution of battery life:

Life (in years) Frequency
\(X<1\) 12
\(1<X \le 2\) 94
\(2<X \le 3\) 170
\(3 <X \le 4\) 188
\(4<X \le 5\) 28
\(5<X\) 8
500

Based on this data, test whether battery life follows a normal distribution with \(\mu = 2.8\) and \(\sigma^2 = 1.1^2\). Clearly state your hypotheses and use a significance level of \(\alpha = 5\%\).

Solution:

Hypotheses

H0: The battery life follows a normal distribution with \(\mu = 2.8\) and \(\sigma^2 = 1.1^2\)..

Ha:The battery life does not follow a normal distribution.

Test Statistic calculation

Life (in years) Probability \(e_i=np_i\) \(f_i\) \(\frac{(f_i-e_i)^2}{e_i}\)
\(X<1\)

\(P(X<1)\)

\(=P(Z<-1.64)\)

\(=0.0505\)

25.25 12 6.9530
\(1<X \le 2\)

\(P(1<X \le 2)\)

\(=P(-1.64<Z\le -0.73)\)

\(=0.1826\)

91.30 94 0.0798
\(2<X \le 3\) 0.3386 169.30 170 0.0029
\(3 <X \le 4\) 0.2902 145.10 188 12.6837
\(4<X \le 5\) 0.1149 57.45 28 15.0966
\(5<X\) 0.0228 11.40 8 1.0140
\(\chi^2=35.83\)

Critical value

At \(\alpha=0.05\) , and for \(df\)=6-1=5, \(\chi^2_{\alpha,5}=11.1\)

Decision

Since \(\chi^2>\chi^2_{\alpha,5}\) so we can reject the null hypothesis.

14.1.3 Uniform distribution (continuous)

To test whether a variable follows uniform distribution between \(a\) to \(b\) we will illustrate the following example.

Example Suppose X, the amount of time a person stays in bed after their alarm goes off, is uniformly distributed between 5 and a minutes. Also, suppose Y , the number of minutes they are late to work is uniformly distributed between 0 and b minutes. Over 100 days, how long they slept past their alarm (X) and how late they were to work (Y ) were recorded (in minutes). However, only the number of days for which X and Y fell within certain ranges was reported in the table below:

5<X<7 7<X<8 8<X<10 Totals
0<Y<2 18 9 12
2<Y<3 9 4 12
3<Y<5 13 9 14
Totals 100

(a) Based on the data given above, test whether \(a\) is equal to 10. Clearly state your hypotheses and use a significance level of \(\alpha = 5\%\).

Solution:

If the X, the amount of time a person stays in bed after their alarm goes off, is uniformly distributed between 5 and \(a\) minutes then the data will fit the uniform distribution with parameter 5 to a=10 minutes. So following hypotheses can be formed:

Hypothesis

H0: The amount of time a person stays in bed after their alarm goes off, is uniformly distributed between 5 and \(a=10\) minutes

Ha: The amount of time a person stays in bed after their alarm goes off, is NOT uniformly distributed between 5 and \(a=10\) minutes.

Test statistic calculation

If \(X\sim U(5,10)\) then PDF, \[f(x)=\frac{1}{10-5}=\frac{1}{5};\ \ 5<x<10\]

So, \(P(5<X<7)=(7-5)*\frac{1}{5}=\frac{2}{5}\) and so on

Bed time (in mins) \(f_i\) \(p_i\) \(e_i=np_i\) \(\frac{(f_i-e_i)^2}{e_i}\)
5<X<7 40 \(\frac{2}{5}\) 40 0.00
7<X<8 22 \(\frac{1}{5}\) 20 0.20
8<X<10 38 \(\frac{2}{5}\) 40 0.10
Totals 100 \(\chi^2=0.3\)

Critical value

At \(\alpha=5\%\) and \(df=3-1=2\), \(\chi^2_{\alpha,2}=5.99\).

Decision

Since, \(\chi^2<\chi^2_{\alpha,2}\) so we cannot reject null hypothesis.

14.2 Test for Independence (Categorical Data)

Consider the following example:

In an experiment to study the dependence of hypertension on smoking habits, the following data were taken on 180 individuals:

Non-smokers Moderate Smokers Heavy Smokers
Hypertension 21 36 30
No hypertension 48 26 19

Test the hypothesis that the presence or absence of hypertension is independent of smoking habits. Use a 0.05 level of significance.

Solution:

We have to test the following hypothesis:

\(H_0:The \ \ column\ \ variable\ \ is\ \ independent\ \ of\ \ the\ \ row\ \ variable\)

\(H_a:The \ \ column\ \ variable\ \ is\ \ not\ \ independent\ \ of\ \ the\ \ row\ \ variable\)

Test statistic

\[ \chi^2=\sum_{i=1}^r\sum_{j=1}^c \frac {(f_{ij}-e_{ij})^2}{e_{ij}} \]

The sampling distribution of the test statistic is approximately chi-squared distributed with \((r-1)\times (c-1)\) degrees of freedom, provided that the sample size is large.

Note

\(f_{ij}=\) Observed frequency of \((i,j)^{th}\) cell;

\(e_{ij}=\)Expected frequency of \((i,j)^{th}\) cell=\(\frac{Row \ \ i \ \ total \times Column \ \ j \ \ total}{Sample \ \ size (n)}\)

Table: Contingency table with Row total and Column total

Non-smokers Moderate Smokers Heavy Smokers Row Total
Hypertension 21 36 30 87
No hypertension 48 26 19 93
Column Total 69 62 49 180

For example,

\[ \boldsymbol {e_{11}=\frac{87\times69}{180}} \]

Chi-square Statistic calculation

Observed, \(f_i\) Expected, \(e_i\) \(\frac{(f_i-e_i)^2}{e_i}\)
21 33.35 4.57
36 29.97 1.21
30 23.68 1.68
48 35.65 4.28
26 32.03 1.14
19 25.32 1.14
\(\chi^2=14.46\)

Critical value

At \(\alpha =0.01\) and with \(df=(2-1)*(3-1)=2\), \(\chi^2_\alpha =9.21\)

Decision

Since \(\chi^2 > \chi^2_\alpha\) so reject the null hypothesis.

Interpretation/conclusion

There is sufficient evidence at the \(5\%\) significance level to infer that the smoking habits is not independent of the the presence or absence of hypertension, rather the two variables are associated.