9 Sampling and Sampling distributions

9.1 Some preliminary idea (Anderson 2020)

An element is the entry on which data are collected.
A population is the collection of all the elements of interest.
A sample is a subset of the population.
A sampling frame is the list of all the elements in the population of interest.

9.2 Sampling from a Finite Population

9.2.1 Simple random sample (Finite population)

A simple random sample (SRS) of size $n$ from a finite population of size $N$ is a sample selected such that each possible sample of size $n$ has the same probability of being selected.

Sampling can be with replacement (SWR).
Sampling can be without replacement (SWOR)(recommended).

9.3 Sampling from an Infinite Population

In the case of infinite population, it is not possible to develop a sampling from. In that case statisticians recommend selecting a random sample

9.3.1 Random sample (Infinite population)

A random sample of size $n$ from an infinite population is a sample selected such that the following conditions are satisfied.

1. Each element selected comes from the same population.

2. Each element is selected independently.

Notes and Comments

1) A sample selected randomly from a population (finite or infinite) is referred as a random sample. The procedure of selecting a sample randomly is known as probability sampling.

2) The number of ways simple random samples by SWR of size $n$ that can be selected from a finite population of size $N$ is

\[ N^n \]

3) The number of ways different simple random samples by SWOR of size $n$ that can be selected from a finite population of size $N$ is

\[ {N \choose{n}}=\frac{N!}{n!(N-n)!} \]

4) Some other probability sampling methods are stratified random sampling, cluster sampling, and systematic sampling. We will discuss these methods later.

5) We use the term “simple” in simple random sampling to clarify that this is the probability sampling method that assures each sample of size $n$ has the same probability of being selected.

9.3.2 Selecting simple random sample using R

Suppose we have a variable $X$ and let it contains $N=5$ elements as follows:

$X=\{1,3,5,7,9\}$

Using R we can draw several simple random samples of size $n=3$.

Now we draw a random sample (by using both SWR and SWOR )using sample() function in base R.

Code

set.seed(2103) # To keep reproducibility
X=c(1,3,5,7,9) # The elemnets in X variable

# Drawing a random sample without replacement
sample(X,3, replace = FALSE)

[1] 3 7 5

Code

# Drawing a random sample with replacement
sample(X,3, replace = TRUE)

[1] 1 9 9

The all possible that is ${N\choose n}={5\choose 3}=10$ samples (by SWOR) are :

         [,1] [,2] [,3]
Sample1     1    3    5
Sample2     1    3    7
Sample3     1    3    9
Sample4     1    5    7
Sample5     1    5    9
Sample6     1    7    9
Sample7     3    5    7
Sample8     3    5    9
Sample9     3    7    9
Sample10    5    7    9

9.4 Sampling distribution

The probability distribution of a sample statistic is called a sampling distribution.

For example, due to sampling variability the sample mean $\bar X$ has a sampling distribution.

Illustration Consider a population of variable X: 1,3,5,7,9.

Task-1: Compute population mean $\mu$ .

Solution:

Here, $\mu=\frac{\sum x}{N}=\frac{1+3+...+9}{5}=5$

Task-2: Draw all possible samples (without replacement) of size $n=2$ from this population. Then compute the means of all samples.

Solution:

Table 9.1: All Samples of Size 2 and Their Means

Sample	Sample mean, $\bar x$
1,3	2
1,5	3
1,7	4
1,9	5
3,5	4
3,7	5
3,9	6
5,7	6
5,9	7
7,9	8

Task-3: Construct a probability distribution of sample means, $\bar X$ (discrete) and plot it.

Solution:

Table 9.2: Sampling distribution of $\bar x$

$\bar x$	$f(\bar x)$
2	$\frac{1}{10}$
3	$\frac{1}{10}$
4	$\frac{2}{10}$
5	$\frac{2}{10}$
6	$\frac{2}{10}$
7	$\frac{1}{10}$
8	$\frac{1}{10}$

Figure 9.1: Sampling distribution of $\bar{x}$

Task-4: Find the $E(\bar x)$ . Does $E(\bar x)$ same as population mean $\mu$?

Solution:

$E(\bar x)=\sum \bar x \cdot f(\bar x)$

$=2(1/10)+3(1/10)+4(2/10)+\cdot \cdot \cdot+8(1/10)=5$

We can see that $E(\bar x) =5$ is same as $\mu=5$.

NOTE: This phenomenon is known as the unbiasedness of a sample statistic or an estimator. We will discuss it briefly in next chapter.

Home work: Consider a population of variable X: 3,6,9,12,15.

i) Compute population mean $\mu$ .

ii) Draw all possible samples of size $n=3$ from this population without replacement. Then compute the means of all samples.

iii) Construct a probability distribution of sample mean, $\bar x$ (discrete) and plot it.

iv) Find the $E(\bar X)$ . Does $E(\bar X)$ same as population mean $\mu$?

9.5 Sampling distribution of $\bar X$

The sampling distribution of $\bar X$ is the probability distribution of all possible values of the sample mean $\bar X$.

Expected value of $\bar X$:

\[ E(\bar X)=\mu_{\bar x}=\mu \]

If samples are drawn from a infinite population (or finite but $n/N\le 0.05$ (Anderson 2020) or in the case of SWR then the

Standard deviation ox $\bar X$:

\[ \sigma_{\bar x}=\frac{\sigma}{\sqrt n} \]

Standard deviation of $\bar x$ is also known as Standard error of $\bar x$ or $s.e(\bar x)$.

But what is the form of the sampling distribution of $\bar x$?

9.5.1 Central limit theorem (CLT)

The sampling distribution of the mean of a random sample drawn from any population is approximately normal for a sufficiently large sample size. The larger the sample size, the more closely the sampling distribution of $\bar x$ will resemble a normal distribution.

The Central Limit Theorem

Let, $X_1$,$X_2$, …. be a sequence of independently and identically distributed random variables with common mean $\mu$ and common variance $\sigma^2$. We define

\[ Z=\frac{\bar x -\mu_{\bar X}}{\sigma_{\bar X}} \]

Then the $Z$ will be approximately normally distributed as the sample size $n \rightarrow\infty$.

The definition of “sufficiently large” depends on the extent of non-normality of $X$ . Some authors consider a sample will be sufficiently large if $n\ge30$ (Walpole et al. 2017).

9.5.2 Central Limit Theorem through simulation

In this section we illustrates how sampling distributions of sample means approximate to normal or bell shaped distribution as we increase the sample size .

At first, we consider a population data regarding gdp per capita (USD),2023 of 218 countries. We can see that the distribution of gdp per capita is highly skewed to the right (see Figure 9.2).

Figure 9.2: Frequency histogram of GDP percapita of N=218 countries

Now we draw 1000 random samples (without replacement) of different sample sizes and then plot the histogram of samples means.

(a) Sampling distribution of sample mean for sample size n=10

From Figure 9.3 we can see that as the sample size increases, the sampling distribution of sample mean tends to bell-shaped or normal though the population data was very skewed to the right. This simulation clearly demonstrate the fact of Central Limit Theorem (CLT).

For more interactive simulation of CLT please click here to visit the ShinyApp for Central Limit Theorem Simulation.

Problem 8.1 The foreman of a bottling plant has observed that the amount of soda in each 32-ounce bottle is actually a normally distributed random variable, with a mean of 32.2 ounces and a standard deviation of .3 ounce (Keller 2014, 308).

a. If a customer buys one bottle, what is the probability that the bottle will contain more than 32 ounces?

b. If a customer buys a carton of four bottles, what is the probability that the mean amount of the four bottles will be greater than 32 ounces?

Problem 8.2 Suppose a subdivision on the southwest side of Denver, Colorado, contains 2215 houses. The subdivision was built in 1983. A sample of 100 houses is selected randomly and evaluated by an appraiser. If the mean appraised value of a house in this subdivision for all houses is $177,000, with a standard deviation of $8,500, what is the probability that the sample average is greater than $185,000? (Black 2012, 243 (population size is changed))

Problem 8.3 A scientist is studying the heights of men in Australia. The true population mean $\mu$ is unknown but the true population standard deviation is assumed to be 2.5 inches. Suppose the scientist randomly samples 100 men. Find the probability that the difference between the sample mean and the true population mean is less than 0.5 inches.

Problem 8.4 In winter, it tends to rain a lot in Canberra. Suppose that the amount of rain that falls on any given winter day in Canberra is normally distributed with a mean of $2.3 mm$ and a variance of $1.1 \ \ mm^2$.

(a) Find the probability that between 1.9 and 3.4 mm of rain fell today.

(b) Find the probability that the total amount of rain that falls over the next 20 days is between 54.3 and 57.1 mm.

Problem 8.5 Suppose, we load on a plane 100 packages whose weights are independent random variables that are uniformly distributed between 5 and 50 pounds. What is the probability that the total weight will exceed 3000 pounds?

9.6 Sampling distribution of sample proportion, $\hat p$

The sample proportion $\hat p$ is the point estimator of the population proportion $p$. The formula for computing the sample proportion is

\[ \hat p=\frac {x}{n} \] Where,

$x=$ number of successes in the sample of size $n$.

Case-I (small sample): If $X\sim Bin(n,p)$ then $\hat p$ also follows binomial distribution with

\[ Mean: E(\hat p)=p \]

\[ Variance: \sigma^2_{\hat p}=\frac{p(1-p)}{n} \]

Case-II (large sample): When the sample size is large enough so that $np$ and $n(1-p)$ are greater than or equal to $5$ then $\hat p$ will be approximately normally distributed with

\[ Mean: E(\hat p)=p \]

\[ Variance: \sigma^2_{\hat p}=\frac{p(1-p)}{n} \]

Problem 8.6 According to the Internal Revenue Service, 75% of all tax returns lead to a refund. A random sample of 100 tax returns is taken.

What is the mean of the distribution of the sample proportion of returns leading to refunds?
What is the variance of the sample proportion?
What is the standard error of the sample proportion?
What is the probability that the sample proportion exceeds 0.8?

Problem 8.7 A random sample of 270 homes was taken from a large population of older homes to estimate the proportion of homes with unsafe wiring. If, in fact, 20% of the homes have unsafe wiring, what is the probability that the sample proportion will be between 16% and 24%?

9.7 Sampling Distribution of the Sample Variances

Like as sample mean, sample variance is also considered as a random variable due to sampling variability. If we a random sample $\{x_1,x_2, . . . , x_n\}$ is a random sample of size $n$ then the quantity

\[ s^2=\frac{1}{n-1} \sum_{i=1}^n (x_i-\bar x)^2 \]

is called the sample variance.

Sampling Distribution of the Sample Variances

If $s^2$ is the variance of a random sample of size $n$ taken from a normal population having the variance $\sigma^2$, then the statistic

\[ \chi^2=\frac{(n-1)s^2}{\sigma^2}=\frac{\sum_{i=1}^n (x_i-\bar x)^2}{\sigma^2} \]

has a Chi-squared distribution with $\nu =n-1$ degrees of freedom.

Mean of $s^2$: $E(s^2)=\sigma^2$

Variance of $s^2$: $Var(s^2)=\frac{2\sigma^4}{(n-1)}$

How to to determine the area under the curve of $\chi^2$ distribution?

In every statistics textbook area under the $\chi^2$ distribution can be determined for a given degrees of freedom. The distribution is defined for only positive values, since variances are all positive values. For a given probability or area say $\alpha$ and degrees of freedom $\nu$ we can determine the value of $\chi^2$ to the upper tail such that:

\[ P(\chi^2>\chi^2_{\alpha})=\alpha \]

Figure 9.4: The chi-squared distribution

For example, when $\alpha=0.05$ and $\nu=10$ the value of $\chi^2_{\alpha}$ is $18.307$.

Figure 9.5: Chi-square distribution. Source: Appendix B, TABLE 3 (Anderson 2020)

Problem 8.8 A random sample of size $n = 18$ is obtained from a normally distributed population with a population mean of $\mu = 46$ and a variance of $\sigma^2 = 50$.

What is the probability that the sample mean is greater than 50?
What is the value of the sample variance such that 5% of the sample variances would be less than this value?
What is the value of the sample variance such that 5% of the sample variances would be greater than this value?

Problem 8.9 A process produces batches of a chemical whose impurity concentrations follow a normal distribution with a variance of 1.75. A random sample of 20 of these batches is chosen. Find the probability that the sample variance exceeds 3.10.

Solution:

Let, $X$ be the impurity concentration

Given, $\sigma^2=1.75$; $n=20$ . We have to compute

\[ P[s^2>3.10]=P\left[ \frac{(n-1)s^2}{\sigma^2}>\frac{(20-1)(3.10)}{1.75} \right ] \]

\[ =P\left[ \chi^2>33.657 \right]\approx 0.01 \]

So there is approximately 1% chance that the sample variance exceeds 3.10.

9.8 t-Distribution

Let $Z\sim N(0,1)$ and $V\sim \chi^2 _\nu$ . If $Z$ and $V$ are independent then the random variable

\[ T=\frac{Z}{\sqrt {V/\nu}} \]

said to have a Student-t distribution with $\nu$ degrees of freedom. The PDF of $T$ is

\[ f(t)=\frac{\Gamma [(\nu+1)/2]}{\sqrt{ \pi\nu} \ \ \Gamma{(\nu/2)}}\left(1+\frac{t^2}{\nu}\right)^{-(\nu+1)/2} ; -\infty<t<\infty. \]

Theorem

Given a random sample of $n$ observations, with sample mean $\bar x$ and sample standard deviation $s$, from a normally distributed population with mean $\mu$, the random variable $t$ follows the Student’s t distribution with $\nu=(n - 1)$ degrees of freedom and is given by

\[ t=\frac{\bar x-\mu}{s/\sqrt n} \]

Properties:

Symmetry: $t$-distribution is symmetric about mean (zero). So

if $P(T>t_\nu)=\alpha$ then $P(T<-t_\nu)=\alpha$.
Convergence to Normal: As $n\rightarrow\infty$ then the distribution of $T_\nu$ approaches the standard Normal distribution.
Cauchy as special case: The $T_1$ distribution is the same as the Cauchy distribution.

How to to determine the area under the curve of $t$- distribution?

From $t$ -distribution table we can determine the value of $t$ for a given $area$ and degrees of freedom.

For example, with $n=11$ and area in upper tail 0.05 the the value of $t$ is $1.812$. That is

\[ P(T>1.812)=0.05 \]

Due to symmetry

\[ P(T<-1.812)=0.05 \]

t-distribution. Source: Appendix B, TABLE 3 (Anderson 2020)

9.9 F-Distribution

9.9.1 The F -Distribution with Two Sample Variances

If $s_1^2$ and $s_2^2$ are the variances of independent random samples of size $n_1$ and $n_2$ taken from normal populations with variances $\sigma_1^2$ and $\sigma_2^2$, respectively, then the random variable

\[ F=\frac{s_1^2/\sigma_1^2}{s_2^2/\sigma_2^2} \]

has an $F$ distribution with numerator degrees of freedom $(n_1-1)$ and denominator degrees of freedom $(n_2-1)$.

\(\bar x\)	\(f(\bar x)\)
2	\(\frac{1}{10}\)
3	\(\frac{1}{10}\)
4	\(\frac{2}{10}\)
5	\(\frac{2}{10}\)
6	\(\frac{2}{10}\)
7	\(\frac{1}{10}\)
8	\(\frac{1}{10}\)

9 Sampling and Sampling distributions

9.1 Some preliminary idea (Anderson 2020)

9.2 Sampling from a Finite Population

9.2.1 Simple random sample (Finite population)

9.3 Sampling from an Infinite Population

9.3.1 Random sample (Infinite population)

9.3.2 Selecting simple random sample using R

9.4 Sampling distribution

9.5 Sampling distribution of \(\bar X\)

9.5.1 Central limit theorem (CLT)

9.5.2 Central Limit Theorem through simulation

9.6 Sampling distribution of sample proportion, \(\hat p\)

9.7 Sampling Distribution of the Sample Variances

9.8 t-Distribution

9.9 F-Distribution

9.9.1 The F -Distribution with Two Sample Variances