14 Chi-squared Test

14.1 Goodness of Fit Test

In this section we use a chi-square test to determine whether a population being sampled has a specific probability distribution.

14.1.1 A Multinomial Population

Multinomial Experiment

A multinomial experiment is one that possesses the following properties.

The experiment consists of a fixed number \(n\) of trials.
The outcome of each trial can be classified into one of \(k\) categories, called \(cells\).
The probability \(p_i\) that the outcome will fall into cell \(i\) remains constant for each trial. Moreover, \(p_1 + p_2 + ...+ p_k = 1\)
Each trial of the experiment is independent of the other trials.

Testing Market Shares

Company A has recently conducted aggressive advertising campaigns to maintain and possibly increase its share of the market (currently \(45\%\)) for fabric softener. Its main competitor, company B, has \(40\%\) of the market, and a number of other competitors account for the remaining \(15\%\).

To determine whether the market shares changed after the advertising campaign, the marketing manager for company A solicited the preferences of a random sample of 200 customers of fabric softener.

Of the 200 customers, 102 indicated a preference for company A’s product, 82 preferred company B’s fabric softener, and the remaining 16 preferred the products of one of the competitors. Can the analyst infer at the \(5\%\) significance level that customer preferences have changed from their levels before the advertising campaigns were launched?

We recognize this experiment as a multinomial experiment, and we identify the technique as the chi-squared goodness-of-fit test. Because we want to know whether the market shares have changed, we specify those precampaign market shares in the null hypothesis.

\[ H_0: p_1=0.45;\ \ p_2=0.40; \ \ p_3=0.15 \]

The alternative hypothesis attempts to answer our question, Have the proportions changed? Thus,

\[ H_1: At \ \ least \ \ one \ \ p_i \ \ is \ \ not\ \ equal\ \ to\ \ its\ \ specified\ \ value \]

Chi-Squared Goodness-of-Fit Test Statistic

\[ \chi^2 =\sum_{i=1}^k \frac{(f_i-e_i)^2}{e_i} \] \(Where, \ \ f_i=observed \ \ frequency \ \ and \ \ e_i=expected \ \ frequency\)

Note that, \(e_i=n*p_i\)

The sampling distribution of the test statistic is approximately chi-squared distributed with \(k-1\) degrees of freedom, provided that the sample size is large.

Test Statistic calculation

Company	Observed frequency, \(f_i\)	Expected frequency, \(e_i\)	\((f_i-e_i)\)	\(\frac{(f_i-e_i)^2}{e_i}\)
A	102	90	12	1.60
B	82	80	2	0.05
Other	16	30	-14	6.53
Total	200	200		\(\chi^2=8.18\)

Critical value

At \(\alpha =0.05\) and for \(df=3-1=2\), \(\chi^2_\alpha=5.99\) .

Decision

Since \(\chi^2 > \chi^2_{\alpha}\) so reject null hypothesis.

Interpretation/ Conclusion

There is sufficient evidence at the \(5\%\) significance level to infer that the proportions have changed since the advertising campaigns were implemented.

14.1.2 Normal population (continuous)

To test whether a variable follows normal distribution with mean \(\mu\) and variance \(\sigma^2\) we will illustrate the following example.

Example A random sample of 500 car batteries was taken and the life of each battery was measured. Letting X denote battery life in years, suppose that the sample revealed the following distribution of battery life:

Life (in years)	Frequency
\(X<1\)	12
\(1<X \le 2\)	94
\(2<X \le 3\)	170
\(3 <X \le 4\)	188
\(4<X \le 5\)	28
\(5<X\)	8
	500

Based on this data, test whether battery life follows a normal distribution with \(\mu = 2.8\) and \(\sigma^2 = 1.1^2\). Clearly state your hypotheses and use a significance level of \(\alpha = 5\%\).

Solution:

Hypotheses

H₀: The battery life follows a normal distribution with \(\mu = 2.8\) and \(\sigma^2 = 1.1^2\)..

Ha:The battery life does not follow a normal distribution.

Test Statistic calculation

Life (in years)	Probability	\(e_i=np_i\)	\(f_i\)	\(\frac{(f_i-e_i)^2}{e_i}\)
\(X<1\)	\(P(X<1)\) \(=P(Z<-1.64)\) \(=0.0505\)	25.25	12	6.9530
\(1<X \le 2\)	\(P(1<X \le 2)\) \(=P(-1.64<Z\le -0.73)\) \(=0.1826\)	91.30	94	0.0798
\(2<X \le 3\)	0.3386	169.30	170	0.0029
\(3 <X \le 4\)	0.2902	145.10	188	12.6837
\(4<X \le 5\)	0.1149	57.45	28	15.0966
\(5<X\)	0.0228	11.40	8	1.0140
				\(\chi^2=35.83\)

Critical value

At \(\alpha=0.05\) , and for \(df\)=6-1=5, \(\chi^2_{\alpha,5}=11.1\)

Decision

Since \(\chi^2>\chi^2_{\alpha,5}\) so we can reject the null hypothesis.

14.1.3 Uniform distribution (continuous)

To test whether a variable follows uniform distribution between \(a\) to \(b\) we will illustrate the following example.

Example Suppose X, the amount of time a person stays in bed after their alarm goes off, is uniformly distributed between 5 and a minutes. Also, suppose Y , the number of minutes they are late to work is uniformly distributed between 0 and b minutes. Over 100 days, how long they slept past their alarm (X) and how late they were to work (Y ) were recorded (in minutes). However, only the number of days for which X and Y fell within certain ranges was reported in the table below:

	5<X<7	7<X<8	8<X<10	Totals
0<Y<2	18	9	12
2<Y<3	9	4	12
3<Y<5	13	9	14
Totals				100

(a) Based on the data given above, test whether \(a\) is equal to 10. Clearly state your hypotheses and use a significance level of \(\alpha = 5\%\).

Solution:

If the X, the amount of time a person stays in bed after their alarm goes off, is uniformly distributed between 5 and \(a\) minutes then the data will fit the uniform distribution with parameter 5 to a=10 minutes. So following hypotheses can be formed:

Hypothesis

H0: The amount of time a person stays in bed after their alarm goes off, is uniformly distributed between 5 and \(a=10\) minutes

Ha: The amount of time a person stays in bed after their alarm goes off, is NOT uniformly distributed between 5 and \(a=10\) minutes.

Test statistic calculation

If \(X\sim U(5,10)\) then PDF, \[f(x)=\frac{1}{10-5}=\frac{1}{5};\ \ 5<x<10\]

So, \(P(5<X<7)=(7-5)*\frac{1}{5}=\frac{2}{5}\) and so on

Bed time (in mins)	\(f_i\)	\(p_i\)	\(e_i=np_i\)	\(\frac{(f_i-e_i)^2}{e_i}\)
5<X<7	40	\(\frac{2}{5}\)	40	0.00
7<X<8	22	\(\frac{1}{5}\)	20	0.20
8<X<10	38	\(\frac{2}{5}\)	40	0.10
Totals	100			\(\chi^2=0.3\)

Critical value

At \(\alpha=5\%\) and \(df=3-1=2\), \(\chi^2_{\alpha,2}=5.99\).

Decision

Since, \(\chi^2<\chi^2_{\alpha,2}\) so we cannot reject null hypothesis.

14.2 Test for Independence (Categorical Data)

Consider the following example:

In an experiment to study the dependence of hypertension on smoking habits, the following data were taken on 180 individuals:

	Non-smokers	Moderate Smokers	Heavy Smokers
Hypertension	21	36	30
No hypertension	48	26	19

Test the hypothesis that the presence or absence of hypertension is independent of smoking habits. Use a 0.05 level of significance.

Solution:

We have to test the following hypothesis:

\(H_0:The \ \ column\ \ variable\ \ is\ \ independent\ \ of\ \ the\ \ row\ \ variable\)

\(H_a:The \ \ column\ \ variable\ \ is\ \ not\ \ independent\ \ of\ \ the\ \ row\ \ variable\)

Test statistic

\[ \chi^2=\sum_{i=1}^r\sum_{j=1}^c \frac {(f_{ij}-e_{ij})^2}{e_{ij}} \]

The sampling distribution of the test statistic is approximately chi-squared distributed with \((r-1)\times (c-1)\) degrees of freedom, provided that the sample size is large.

Note

\(f_{ij}=\) Observed frequency of \((i,j)^{th}\) cell;

\(e_{ij}=\)Expected frequency of \((i,j)^{th}\) cell=\(\frac{Row \ \ i \ \ total \times Column \ \ j \ \ total}{Sample \ \ size (n)}\)

Table: Contingency table with Row total and Column total

	Non-smokers	Moderate Smokers	Heavy Smokers	Row Total
Hypertension	21	36	30	87
No hypertension	48	26	19	93
Column Total	69	62	49	180

For example,

\[ \boldsymbol {e_{11}=\frac{87\times69}{180}} \]

Chi-square Statistic calculation

Observed, \(f_i\)	Expected, \(e_i\)	\(\frac{(f_i-e_i)^2}{e_i}\)
21	33.35	4.57
36	29.97	1.21
30	23.68	1.68
48	35.65	4.28
26	32.03	1.14
19	25.32	1.14
		\(\chi^2=14.46\)

Critical value

At \(\alpha =0.01\) and with \(df=(2-1)*(3-1)=2\), \(\chi^2_\alpha =9.21\)

Decision

Since \(\chi^2 > \chi^2_\alpha\) so reject the null hypothesis.

Interpretation/conclusion

There is sufficient evidence at the \(5\%\) significance level to infer that the smoking habits is not independent of the the presence or absence of hypertension, rather the two variables are associated.