--- title: "Normal Distribution" author: "Scott Yanco" date: "March 26, 2018" output: slidy_presentation: highlight: haddock --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ## Learning Objectives > - Be able to describe the charateristics of the normal distribution > - Be able to use the mean and standard deviation to describe any normal distribution > - Be able to convert a normally distributed variable to the "standard normal distribution" and calculate a Z score. > - Get p values from a standard anormal table (or equivalent procedure in R). > - Relate the Central Limit Theorem (CLT) to sampling distributions > - Relate the normal and binomial distributions ## The Normal Distribution ```{r echo = F} x <- seq(0, 100, by = .001) fx <- dnorm(x, 50,2) plot(x, fx, type = "l", main = "Normal Distribution", xlab = "x", ylab = "probability", xlim = c(35, 65)) ``` ## The Normal Distribution ```{r echo = F} plot(x, fx, type = "l", main = "Normal Distribution", xlab = "x", ylab = "probability", xlim = c(35, 65)) text(55, 0.15, labels = expression(paste(mu,"= 50")), col = "red") abline(v = 50, col = "red") ``` ## The Normal Distribution ```{r echo = F} plot(x, fx, type = "l", main = "Normal Distribution", xlab = "x", ylab = "probability", xlim = c(35, 65)) text(55, 0.15, labels = expression(paste(mu,"= 50")), col = "red") abline(v = 50, col = "red") abline(v = 48, col = "blue") abline(v = 52, col = "blue") text(45, 0.15, labels = expression(paste(sigma,"= 2")), col = "blue") ``` ## Normal Distribution Characteristics > - Continuous variables > - Positive or negative numbers $-\infty$ to $\infty$ > - Symetrical > - Described using $\mu$ and $\sigma$ > - The normal probability density function: \[ \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{- \frac{(x- \mu )^{2}}{2 \sigma ^{2}}} \]
You don't need to memorize this, but note how this is functionally equivalent to the binomial function
## An example (with code!) Let's work an example with owl masses (from my research). ```{r fig.width=4, fig.align = "center",echo=FALSE, warning=FALSE, message=FALSE, cache = T} library(jpeg) library(grid) library(magick) img <- image_read("IMG_0899.JPG") img <- image_rotate(img, 90) grid.raster(img) ``` ## Owl Masses
```{r echo=TRUE, warning=FALSE, message=FALSE, cache = TRUE} library(mosaic) owls <- read.file("owl_morph.csv") head(owls) ```
```{r echo=TRUE, fig.width = 4.5, cache = TRUE} hist(owls$owl_mass) ```
## Owl Masses ```{r echo=TRUE} mean.mass <- mean(owls$owl_mass) print(mean.mass) sd.mass <- sd(owls$owl_mass) print(sd.mass) ``` ## Owl Masses ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} hist(owls$owl_mass, freq = F) x.mass <- seq(35, 70) y.mass <- dnorm(x.mass, mean = mean.mass, sd = sd.mass) lines(x.mass, y.mass) ``` ##One small problem... {.bigger} The following code will give us the probability mass of the normal distribution for values less than 0: ```{r fig.align="center", cache = TRUE, echo=TRUE} pnorm(0, mean = mean.mass, sd = sd.mass) ``` Discuss with your neighbors: what's the problem with this? ##One small problem... {.bigger}
Non-negative probability of having owls with negative mass!
##The Standard Normal (it's like a medium regular) Sometimes it is useful to convert a data set to the "standard normal" distribution. This distribution is a specific parameterization of the normal where: $\mu = 0$ and $\sigma = 1$ There are some very convenient properties of this distribution. What are they? ##Z-Score We can convert data sets to the standard normal with a simple function: $\vec{Z} = \frac{\vec{Y} - \mu}{\sigma}$ Let's do that with the owl data... ##Z Score Example ```{r echo = TRUE} owls$z <- (owls$owl_mass - mean.mass)/sd.mass #note all the cool stuff we did with that code (new data frame column and used scalars on vectors...) head(owls) ``` ##Plot it!
```{r fig.align="center", fig.height = 4, fig.width=4, cache = TRUE, echo=FALSE} hist(owls$owl_mass, main = "Histogram of Raw Owl Masses", breaks = seq(from = 35, to = 70, by = 3.5), freq = F) x.mass <- seq(35, 70) y.mass <- dnorm(x.mass, mean = mean.mass, sd = sd.mass) lines(x.mass, y.mass) ```
```{r fig.align="center", fig.height = 4, fig.width=4, cache = TRUE, echo=FALSE} hist(owls$z, main = "Histogram of Owl Mass Z Scores", breaks = seq(from = -2.5, to = 2, by= 0.5), freq = F) zx.mass <- seq(-2.5, 2, by =.01) zy.mass <- dnorm(zx.mass, mean = 0, sd = 1) lines(zx.mass, zy.mass) ```
##Standard Normal Table Using the table on page 279 of Whitlock and Schluter, what's the probability of a z-score more extreme than $|1.50|$? What about if the z-score were derived from a wing length data set? ##But just use R though... ```{r cache = TRUE, echo = TRUE} 2*pnorm(1.5, lower.tail = F) ``` It even works without having to standardize the distribution... ```{r cache =TRUE, echo = TRUE} pnorm(64, mean = mean.mass, sd = sd.mass, lower.tail = F) #this is the one-tailed version ``` Easier, right? ##Sampling From the Sample Getting back to the owl data set, let's imagine we took a bunch of random samples from this population and look at our sampling distribution (someone remind us what a sampling distribution is...) In groups consider: - What do we know about the sample distribution? (i.e. is it normal?) - What sort of distribution should we expect from the sampling distribution (think back to recitation? ##Sampling From the Sample Let's imagine we took a bunch of random samples from this population and look at our sampling distribution (someone remind us what a sampling distribution is...) ```{r fig.align="center", cache = TRUE, echo=TRUE} sampling.mass <- 0 for (i in 1:1000) { samp <- sample(owls$owl_mass, length(owls$owl_mass), replace = T) sampling.mass[i] <- mean(samp) } ``` ##Sampling From the Sample ```{r fig.align="center", fig.height = 5, cache = TRUE, echo=TRUE} hist(sampling.mass, xlab = "mean owl mass", main = "Sampling Distribution of Mean Owl Masses") ``` ##Is it normal? ```{r fig.align="center", fig.height = 4.5, cache = TRUE, echo=TRUE} x.mass <- seq(20, 80, by = .01) y.mass <- dnorm(x.mass, mean = mean(sampling.mass), sd = sd(sampling.mass)) hist(sampling.mass, xlab = "mean owl mass", main = "Sampling Distribution of Mean Owl Masses", freq = F) lines(x.mass, y.mass) ``` ##The Central Limit Theorem ```{r cache = TRUE} library(beepr) beep(3) ``` "...the sum or mean of a large number of measurements randomly sampled from a non-normal population is approximately normal." (Whitlock and Schluter pg 286) What does this mean? ##The Central Limit Theorem Let's simulate a distinctly non-normal data set and see what happens when we invoke the CLT. Let's consider the average height of human's in a daycare - what might we expect that histogram to look like? ##The Central Limit Theorem - The Daycare Let's simulate a distinctly non-normal data set and see what happens when we invoke the CLT. Let's consider the average height of human's in a daycare - what might we expect that histogram to look like? ```{r fig.align="center", cache = TRUE, echo=TRUE} #code simulates the bimodal distribution ad.ht <- rnorm(30, 67, 3) kid.ht <- rnorm(100, 30, 5) daycare.heights <- c(ad.ht, kid.ht) ``` ##The Central Limit Theorem - The Daycare ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} #let's look at it hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") ``` ##The Central Limit Theorem - The Daycare Let's take a look at the mean of that population: ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} pop.mean.ht <- mean(daycare.heights) print(pop.mean.ht) ``` ##The Central Limit Theorem - The Daycare Let's take a look at the mean of that population: ```{r fig.align="center", fig.height =4, cache = TRUE, echo=TRUE} hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = pop.mean.ht, col = "red") ``` ##Daycare Sample OK, now we've looked at our population, which we typically cannot - let's do this with a sample, which is more realistic. We'll take a random sample of 20, plot it and look at the mean: ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} dc.samp <- sample(daycare.heights, 20) samp.mean <- mean(dc.samp) print(samp.mean) ``` ##Daycare Sample ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = samp.mean, col = "red") ``` ##Daycare Sample Again OK, now we've looked at our population, which we typically cannot - let's do this with a sample, which is more realistic. We'll take a random sample of 20, plot it and look at the mean: ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} dc.samp <- sample(daycare.heights, 20) samp.mean <- mean(dc.samp) print(samp.mean) hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = samp.mean, col = "red") ``` ##and again ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} dc.samp <- sample(daycare.heights, 20) samp.mean <- mean(dc.samp) print(samp.mean) hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = samp.mean, col = "red") ``` ##and again ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} dc.samp <- sample(daycare.heights, 20) samp.mean <- mean(dc.samp) print(samp.mean) hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = samp.mean, col = "red") ``` ##and again ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} dc.samp <- sample(daycare.heights, 20) samp.mean <- mean(dc.samp) print(samp.mean) hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = samp.mean, col = "red") ``` ##But Why Though? Why are the results different each time? Discuss with your neighbors. ##SamplING Distribution Let's do that sample trick a bunch of times and look at that distribution (sampling distribution). ```{r fig.align="center", cache = TRUE, echo=TRUE} reps <- 1000 sampling.mean.dc <- 0 for (i in 1:reps) { dc.samp <- sample(daycare.heights, 20, replace = F) samp.mean <- mean(dc.samp) sampling.mean.dc[i] <- samp.mean } ``` ##SamplING Distribution ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)") ``` ##SamplING Distribution ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)") abline(v= mean(sampling.mean.dc), col = "red") text(47, 150, "Red = mean of \n sampling distribution", col = "red") ``` ##SamplING Distribution ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)") abline(v= pop.mean.ht, col = "blue") text(32, 150, "Blue = population mean", col = "blue") abline(v= mean(sampling.mean.dc), col = "red") text(47, 150, "Red = mean of \n sampling distribution", col = "red") ``` ##SamplING Distribution ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)", freq = F, ylim = c(0, .12)) abline(v= pop.mean.ht, col = "blue") text(32, 150, "Blue = population mean", col = "blue") abline(v= mean(sampling.mean.dc), col = "red") text(47, 150, "Red = mean of \n sampling distribution", col = "red") x.ht <- seq(30, 50, by = .1) y.ht <- dnorm(x.ht, mean = mean(sampling.mean.dc), sd = sd(sampling.mean.dc)) lines(x.ht, y.ht) ``` ##Binomial to Normal This concept is really just invoking the CLT again. Let's use the owl sex data - we'll first simulate taking a bunch of random samples to get the $\mu$ and $\sigma$. The we'll use the shortcut the book showed us. ##Binomial to Normal ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} p.sex <- props(owls$owl_sex) print(p.sex) ``` ##Binomial to Normal ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} successes <- 0 for (i in 1:reps) { samp <- rbinom(1, size = 20, prob = p.sex[1]) successes[i] <- samp } ``` ##Binomial to Normal ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} mean(successes) sd(successes) ``` or use the shortcut $mean = np$ and $SD = \sqrt{np(1-p)}$ ```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} n = 20 p = .25 #mean n*p #SD sqrt(n*p * (1-p)) ``` ##In fact... These shortcuts work precisely BECAUSE of the CLT! Cool right? Remember that $SE = SD$ of the Sampling distribution? It's because of the CLT. Remember $95 \% CI \approx 2 \times SE$ ? It's because fo the CLT. A large proportion of inferential statistics rely upon the CLT to work... as you will see in the remainder of the semester.