--- title: "Normal Distribution" author: "Scott Yanco" date: "March 26, 2018" output: slidy_presentation: highlight: haddock --- {r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE)  ## Learning Objectives > - Be able to describe the charateristics of the normal distribution > - Be able to use the mean and standard deviation to describe any normal distribution > - Be able to convert a normally distributed variable to the "standard normal distribution" and calculate a Z score. > - Get p values from a standard anormal table (or equivalent procedure in R). > - Relate the Central Limit Theorem (CLT) to sampling distributions > - Relate the normal and binomial distributions ## The Normal Distribution {r echo = F} x <- seq(0, 100, by = .001) fx <- dnorm(x, 50,2) plot(x, fx, type = "l", main = "Normal Distribution", xlab = "x", ylab = "probability", xlim = c(35, 65))  ## The Normal Distribution {r echo = F} plot(x, fx, type = "l", main = "Normal Distribution", xlab = "x", ylab = "probability", xlim = c(35, 65)) text(55, 0.15, labels = expression(paste(mu,"= 50")), col = "red") abline(v = 50, col = "red")  ## The Normal Distribution {r echo = F} plot(x, fx, type = "l", main = "Normal Distribution", xlab = "x", ylab = "probability", xlim = c(35, 65)) text(55, 0.15, labels = expression(paste(mu,"= 50")), col = "red") abline(v = 50, col = "red") abline(v = 48, col = "blue") abline(v = 52, col = "blue") text(45, 0.15, labels = expression(paste(sigma,"= 2")), col = "blue")  ## Normal Distribution Characteristics > - Continuous variables > - Positive or negative numbers$-\infty$to$\infty$> - Symetrical > - Described using$\mu$and$\sigma> - The normal probability density function: $\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{- \frac{(x- \mu )^{2}}{2 \sigma ^{2}}}$ You don't need to memorize this, but note how this is functionally equivalent to the binomial function ## An example (with code!) Let's work an example with owl masses (from my research). {r fig.width=4, fig.align = "center",echo=FALSE, warning=FALSE, message=FALSE, cache = T} library(jpeg) library(grid) library(magick) img <- image_read("IMG_0899.JPG") img <- image_rotate(img, 90) grid.raster(img)  ## Owl Masses {r echo=TRUE, warning=FALSE, message=FALSE, cache = TRUE} library(mosaic) owls <- read.file("owl_morph.csv") head(owls)  {r echo=TRUE, fig.width = 4.5, cache = TRUE} hist(owlsowl_mass) 
## Owl Masses {r echo=TRUE} mean.mass <- mean(owls$owl_mass) print(mean.mass) sd.mass <- sd(owls$owl_mass) print(sd.mass)  ## Owl Masses {r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} hist(owlsowl_mass, freq = F) x.mass <- seq(35, 70) y.mass <- dnorm(x.mass, mean = mean.mass, sd = sd.mass) lines(x.mass, y.mass)  ##One small problem... {.bigger} The following code will give us the probability mass of the normal distribution for values less than 0: {r fig.align="center", cache = TRUE, echo=TRUE} pnorm(0, mean = mean.mass, sd = sd.mass)  Discuss with your neighbors: what's the problem with this? ##One small problem... {.bigger} Non-negative probability of having owls with negative mass! ##The Standard Normal (it's like a medium regular) Sometimes it is useful to convert a data set to the "standard normal" distribution. This distribution is a specific parameterization of the normal where:\mu = 0$and$\sigma = 1$There are some very convenient properties of this distribution. What are they? ##Z-Score We can convert data sets to the standard normal with a simple function:$\vec{Z} = \frac{\vec{Y} - \mu}{\sigma}$Let's do that with the owl data... ##Z Score Example {r echo = TRUE} owls$z <- (owlsowl_mass - mean.mass)/sd.mass #note all the cool stuff we did with that code (new data frame column and used scalars on vectors...) head(owls)  ##Plot it! {r fig.align="center", fig.height = 4, fig.width=4, cache = TRUE, echo=FALSE} hist(owlsowl_mass, main = "Histogram of Raw Owl Masses", breaks = seq(from = 35, to = 70, by = 3.5), freq = F) x.mass <- seq(35, 70) y.mass <- dnorm(x.mass, mean = mean.mass, sd = sd.mass) lines(x.mass, y.mass) 
{r fig.align="center", fig.height = 4, fig.width=4, cache = TRUE, echo=FALSE} hist(owls$z, main = "Histogram of Owl Mass Z Scores", breaks = seq(from = -2.5, to = 2, by= 0.5), freq = F) zx.mass <- seq(-2.5, 2, by =.01) zy.mass <- dnorm(zx.mass, mean = 0, sd = 1) lines(zx.mass, zy.mass)  ##Standard Normal Table Using the table on page 279 of Whitlock and Schluter, what's the probability of a z-score more extreme than$|1.50|? What about if the z-score were derived from a wing length data set? ##But just use R though... {r cache = TRUE, echo = TRUE} 2*pnorm(1.5, lower.tail = F)  It even works without having to standardize the distribution... {r cache =TRUE, echo = TRUE} pnorm(64, mean = mean.mass, sd = sd.mass, lower.tail = F) #this is the one-tailed version  Easier, right? ##Sampling From the Sample Getting back to the owl data set, let's imagine we took a bunch of random samples from this population and look at our sampling distribution (someone remind us what a sampling distribution is...) In groups consider: - What do we know about the sample distribution? (i.e. is it normal?) - What sort of distribution should we expect from the sampling distribution (think back to recitation? ##Sampling From the Sample Let's imagine we took a bunch of random samples from this population and look at our sampling distribution (someone remind us what a sampling distribution is...) {r fig.align="center", cache = TRUE, echo=TRUE} sampling.mass <- 0 for (i in 1:1000) { samp <- sample(owlsowl_mass, length(owlsowl_mass), replace = T) sampling.mass[i] <- mean(samp) }  ##Sampling From the Sample {r fig.align="center", fig.height = 5, cache = TRUE, echo=TRUE} hist(sampling.mass, xlab = "mean owl mass", main = "Sampling Distribution of Mean Owl Masses")  ##Is it normal? {r fig.align="center", fig.height = 4.5, cache = TRUE, echo=TRUE} x.mass <- seq(20, 80, by = .01) y.mass <- dnorm(x.mass, mean = mean(sampling.mass), sd = sd(sampling.mass)) hist(sampling.mass, xlab = "mean owl mass", main = "Sampling Distribution of Mean Owl Masses", freq = F) lines(x.mass, y.mass)  ##The Central Limit Theorem {r cache = TRUE} library(beepr) beep(3)  "...the sum or mean of a large number of measurements randomly sampled from a non-normal population is approximately normal." (Whitlock and Schluter pg 286) What does this mean? ##The Central Limit Theorem Let's simulate a distinctly non-normal data set and see what happens when we invoke the CLT. Let's consider the average height of human's in a daycare - what might we expect that histogram to look like? ##The Central Limit Theorem - The Daycare Let's simulate a distinctly non-normal data set and see what happens when we invoke the CLT. Let's consider the average height of human's in a daycare - what might we expect that histogram to look like? {r fig.align="center", cache = TRUE, echo=TRUE} #code simulates the bimodal distribution ad.ht <- rnorm(30, 67, 3) kid.ht <- rnorm(100, 30, 5) daycare.heights <- c(ad.ht, kid.ht)  ##The Central Limit Theorem - The Daycare {r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} #let's look at it hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")  ##The Central Limit Theorem - The Daycare Let's take a look at the mean of that population: {r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} pop.mean.ht <- mean(daycare.heights) print(pop.mean.ht)  ##The Central Limit Theorem - The Daycare Let's take a look at the mean of that population: {r fig.align="center", fig.height =4, cache = TRUE, echo=TRUE} hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = pop.mean.ht, col = "red")  ##Daycare Sample OK, now we've looked at our population, which we typically cannot - let's do this with a sample, which is more realistic. We'll take a random sample of 20, plot it and look at the mean: {r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} dc.samp <- sample(daycare.heights, 20) samp.mean <- mean(dc.samp) print(samp.mean)  ##Daycare Sample {r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = samp.mean, col = "red")  ##Daycare Sample Again OK, now we've looked at our population, which we typically cannot - let's do this with a sample, which is more realistic. We'll take a random sample of 20, plot it and look at the mean: {r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} dc.samp <- sample(daycare.heights, 20) samp.mean <- mean(dc.samp) print(samp.mean) hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = samp.mean, col = "red")  ##and again {r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} dc.samp <- sample(daycare.heights, 20) samp.mean <- mean(dc.samp) print(samp.mean) hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = samp.mean, col = "red")  ##and again {r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} dc.samp <- sample(daycare.heights, 20) samp.mean <- mean(dc.samp) print(samp.mean) hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = samp.mean, col = "red")  ##and again {r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} dc.samp <- sample(daycare.heights, 20) samp.mean <- mean(dc.samp) print(samp.mean) hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)") abline(v = samp.mean, col = "red")  ##But Why Though? Why are the results different each time? Discuss with your neighbors. ##SamplING Distribution Let's do that sample trick a bunch of times and look at that distribution (sampling distribution). {r fig.align="center", cache = TRUE, echo=TRUE} reps <- 1000 sampling.mean.dc <- 0 for (i in 1:reps) { dc.samp <- sample(daycare.heights, 20, replace = F) samp.mean <- mean(dc.samp) sampling.mean.dc[i] <- samp.mean }  ##SamplING Distribution {r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)")  ##SamplING Distribution {r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)") abline(v= mean(sampling.mean.dc), col = "red") text(47, 150, "Red = mean of \n sampling distribution", col = "red")  ##SamplING Distribution {r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)") abline(v= pop.mean.ht, col = "blue") text(32, 150, "Blue = population mean", col = "blue") abline(v= mean(sampling.mean.dc), col = "red") text(47, 150, "Red = mean of \n sampling distribution", col = "red")  ##SamplING Distribution {r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE} hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)", freq = F, ylim = c(0, .12)) abline(v= pop.mean.ht, col = "blue") text(32, 150, "Blue = population mean", col = "blue") abline(v= mean(sampling.mean.dc), col = "red") text(47, 150, "Red = mean of \n sampling distribution", col = "red") x.ht <- seq(30, 50, by = .1) y.ht <- dnorm(x.ht, mean = mean(sampling.mean.dc), sd = sd(sampling.mean.dc)) lines(x.ht, y.ht)  ##Binomial to Normal This concept is really just invoking the CLT again. Let's use the owl sex data - we'll first simulate taking a bunch of random samples to get the\mu$and$\sigma. The we'll use the shortcut the book showed us. ##Binomial to Normal {r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} p.sex <- props(owlsowl_sex) print(p.sex)  ##Binomial to Normal {r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} successes <- 0 for (i in 1:reps) { samp <- rbinom(1, size = 20, prob = p.sex) successes[i] <- samp }  ##Binomial to Normal {r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} mean(successes) sd(successes)  or use the shortcut $mean = np$ and $SD = \sqrt{np(1-p)}$ {r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE} n = 20 p = .25 #mean n*p #SD sqrt(n*p * (1-p))  ##In fact... These shortcuts work precisely BECAUSE of the CLT! Cool right? Remember that $SE = SD$ of the Sampling distribution? It's because of the CLT. Remember $95 \% CI \approx 2 \times SE$ ? It's because fo the CLT. A large proportion of inferential statistics rely upon the CLT to work... as you will see in the remainder of the semester.