---
title: "Normal Distribution"
author: "Scott Yanco"
date: "March 26, 2018"
output:
slidy_presentation:
highlight: haddock
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Learning Objectives
> - Be able to describe the charateristics of the normal distribution
> - Be able to use the mean and standard deviation to describe any normal distribution
> - Be able to convert a normally distributed variable to the "standard normal distribution" and calculate a Z score.
> - Get p values from a standard anormal table (or equivalent procedure in R).
> - Relate the Central Limit Theorem (CLT) to sampling distributions
> - Relate the normal and binomial distributions
## The Normal Distribution
```{r echo = F}
x <- seq(0, 100, by = .001)
fx <- dnorm(x, 50,2)
plot(x, fx, type = "l", main = "Normal Distribution", xlab = "x", ylab = "probability", xlim = c(35, 65))
```
## The Normal Distribution
```{r echo = F}
plot(x, fx, type = "l", main = "Normal Distribution", xlab = "x", ylab = "probability", xlim = c(35, 65))
text(55, 0.15, labels = expression(paste(mu,"= 50")), col = "red")
abline(v = 50, col = "red")
```
## The Normal Distribution
```{r echo = F}
plot(x, fx, type = "l", main = "Normal Distribution", xlab = "x", ylab = "probability", xlim = c(35, 65))
text(55, 0.15, labels = expression(paste(mu,"= 50")), col = "red")
abline(v = 50, col = "red")
abline(v = 48, col = "blue")
abline(v = 52, col = "blue")
text(45, 0.15, labels = expression(paste(sigma,"= 2")), col = "blue")
```
## Normal Distribution Characteristics
> - Continuous variables
> - Positive or negative numbers $-\infty$ to $\infty$
> - Symetrical
> - Described using $\mu$ and $\sigma$
> - The normal probability density function:
\[ \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{- \frac{(x- \mu )^{2}}{2 \sigma ^{2}}} \]
You don't need to memorize this, but note how this is functionally equivalent to the binomial function
## An example (with code!)
Let's work an example with owl masses (from my research).
```{r fig.width=4, fig.align = "center",echo=FALSE, warning=FALSE, message=FALSE, cache = T}
library(jpeg)
library(grid)
library(magick)
img <- image_read("IMG_0899.JPG")
img <- image_rotate(img, 90)
grid.raster(img)
```
## Owl Masses
```{r echo=TRUE, warning=FALSE, message=FALSE, cache = TRUE}
library(mosaic)
owls <- read.file("owl_morph.csv")
head(owls)
```
```{r echo=TRUE, fig.width = 4.5, cache = TRUE}
hist(owls$owl_mass)
```
## Owl Masses
```{r echo=TRUE}
mean.mass <- mean(owls$owl_mass)
print(mean.mass)
sd.mass <- sd(owls$owl_mass)
print(sd.mass)
```
## Owl Masses
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE}
hist(owls$owl_mass, freq = F)
x.mass <- seq(35, 70)
y.mass <- dnorm(x.mass, mean = mean.mass, sd = sd.mass)
lines(x.mass, y.mass)
```
##One small problem... {.bigger}
The following code will give us the probability mass of the normal distribution for values less than 0:
```{r fig.align="center", cache = TRUE, echo=TRUE}
pnorm(0, mean = mean.mass, sd = sd.mass)
```
Discuss with your neighbors: what's the problem with this?
##One small problem... {.bigger}
Non-negative probability of having owls with negative mass!
##The Standard Normal (it's like a medium regular)
Sometimes it is useful to convert a data set to the "standard normal" distribution. This distribution is a specific parameterization of the normal where: $\mu = 0$ and $\sigma = 1$
There are some very convenient properties of this distribution. What are they?
##Z-Score
We can convert data sets to the standard normal with a simple function:
$\vec{Z} = \frac{\vec{Y} - \mu}{\sigma}$
Let's do that with the owl data...
##Z Score Example
```{r echo = TRUE}
owls$z <- (owls$owl_mass - mean.mass)/sd.mass #note all the cool stuff we did with that code (new data frame column and used scalars on vectors...)
head(owls)
```
##Plot it!
```{r fig.align="center", fig.height = 4, fig.width=4, cache = TRUE, echo=FALSE}
hist(owls$owl_mass, main = "Histogram of Raw Owl Masses", breaks = seq(from = 35, to = 70, by = 3.5), freq = F)
x.mass <- seq(35, 70)
y.mass <- dnorm(x.mass, mean = mean.mass, sd = sd.mass)
lines(x.mass, y.mass)
```
```{r fig.align="center", fig.height = 4, fig.width=4, cache = TRUE, echo=FALSE}
hist(owls$z, main = "Histogram of Owl Mass Z Scores", breaks = seq(from = -2.5, to = 2, by= 0.5), freq = F)
zx.mass <- seq(-2.5, 2, by =.01)
zy.mass <- dnorm(zx.mass, mean = 0, sd = 1)
lines(zx.mass, zy.mass)
```
##Standard Normal Table
Using the table on page 279 of Whitlock and Schluter, what's the probability of a z-score more extreme than $|1.50|$?
What about if the z-score were derived from a wing length data set?
##But just use R though...
```{r cache = TRUE, echo = TRUE}
2*pnorm(1.5, lower.tail = F)
```
It even works without having to standardize the distribution...
```{r cache =TRUE, echo = TRUE}
pnorm(64, mean = mean.mass, sd = sd.mass, lower.tail = F) #this is the one-tailed version
```
Easier, right?
##Sampling From the Sample
Getting back to the owl data set, let's imagine we took a bunch of random samples from this population and look at our sampling distribution (someone remind us what a sampling distribution is...)
In groups consider:
- What do we know about the sample distribution? (i.e. is it normal?)
- What sort of distribution should we expect from the sampling distribution (think back to recitation?
##Sampling From the Sample
Let's imagine we took a bunch of random samples from this population and look at our sampling distribution (someone remind us what a sampling distribution is...)
```{r fig.align="center", cache = TRUE, echo=TRUE}
sampling.mass <- 0
for (i in 1:1000) {
samp <- sample(owls$owl_mass, length(owls$owl_mass), replace = T)
sampling.mass[i] <- mean(samp)
}
```
##Sampling From the Sample
```{r fig.align="center", fig.height = 5, cache = TRUE, echo=TRUE}
hist(sampling.mass, xlab = "mean owl mass", main = "Sampling Distribution of Mean Owl Masses")
```
##Is it normal?
```{r fig.align="center", fig.height = 4.5, cache = TRUE, echo=TRUE}
x.mass <- seq(20, 80, by = .01)
y.mass <- dnorm(x.mass, mean = mean(sampling.mass), sd = sd(sampling.mass))
hist(sampling.mass, xlab = "mean owl mass", main = "Sampling Distribution of Mean Owl Masses", freq = F)
lines(x.mass, y.mass)
```
##The Central Limit Theorem
```{r cache = TRUE}
library(beepr)
beep(3)
```
"...the sum or mean of a large number of measurements randomly sampled from a non-normal population is approximately normal." (Whitlock and Schluter pg 286)
What does this mean?
##The Central Limit Theorem
Let's simulate a distinctly non-normal data set and see what happens when we invoke the CLT.
Let's consider the average height of human's in a daycare - what might we expect that histogram to look like?
##The Central Limit Theorem - The Daycare
Let's simulate a distinctly non-normal data set and see what happens when we invoke the CLT.
Let's consider the average height of human's in a daycare - what might we expect that histogram to look like?
```{r fig.align="center", cache = TRUE, echo=TRUE}
#code simulates the bimodal distribution
ad.ht <- rnorm(30, 67, 3)
kid.ht <- rnorm(100, 30, 5)
daycare.heights <- c(ad.ht, kid.ht)
```
##The Central Limit Theorem - The Daycare
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE}
#let's look at it
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
```
##The Central Limit Theorem - The Daycare
Let's take a look at the mean of that population:
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE}
pop.mean.ht <- mean(daycare.heights)
print(pop.mean.ht)
```
##The Central Limit Theorem - The Daycare
Let's take a look at the mean of that population:
```{r fig.align="center", fig.height =4, cache = TRUE, echo=TRUE}
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
abline(v = pop.mean.ht, col = "red")
```
##Daycare Sample
OK, now we've looked at our population, which we typically cannot - let's do this with a sample, which is more realistic.
We'll take a random sample of 20, plot it and look at the mean:
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE}
dc.samp <- sample(daycare.heights, 20)
samp.mean <- mean(dc.samp)
print(samp.mean)
```
##Daycare Sample
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE}
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
abline(v = samp.mean, col = "red")
```
##Daycare Sample Again
OK, now we've looked at our population, which we typically cannot - let's do this with a sample, which is more realistic.
We'll take a random sample of 20, plot it and look at the mean:
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE}
dc.samp <- sample(daycare.heights, 20)
samp.mean <- mean(dc.samp)
print(samp.mean)
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
abline(v = samp.mean, col = "red")
```
##and again
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE}
dc.samp <- sample(daycare.heights, 20)
samp.mean <- mean(dc.samp)
print(samp.mean)
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
abline(v = samp.mean, col = "red")
```
##and again
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE}
dc.samp <- sample(daycare.heights, 20)
samp.mean <- mean(dc.samp)
print(samp.mean)
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
abline(v = samp.mean, col = "red")
```
##and again
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE}
dc.samp <- sample(daycare.heights, 20)
samp.mean <- mean(dc.samp)
print(samp.mean)
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
abline(v = samp.mean, col = "red")
```
##But Why Though?
Why are the results different each time? Discuss with your neighbors.
##SamplING Distribution
Let's do that sample trick a bunch of times and look at that distribution (sampling distribution).
```{r fig.align="center", cache = TRUE, echo=TRUE}
reps <- 1000
sampling.mean.dc <- 0
for (i in 1:reps) {
dc.samp <- sample(daycare.heights, 20, replace = F)
samp.mean <- mean(dc.samp)
sampling.mean.dc[i] <- samp.mean
}
```
##SamplING Distribution
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE}
hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)")
```
##SamplING Distribution
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE}
hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)")
abline(v= mean(sampling.mean.dc), col = "red")
text(47, 150, "Red = mean of \n sampling distribution", col = "red")
```
##SamplING Distribution
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE}
hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)")
abline(v= pop.mean.ht, col = "blue")
text(32, 150, "Blue = population mean", col = "blue")
abline(v= mean(sampling.mean.dc), col = "red")
text(47, 150, "Red = mean of \n sampling distribution", col = "red")
```
##SamplING Distribution
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=FALSE}
hist(sampling.mean.dc, main = "Sampling distribution of heights at a daycare", xlab = "Mean Heights (in.)", freq = F, ylim = c(0, .12))
abline(v= pop.mean.ht, col = "blue")
text(32, 150, "Blue = population mean", col = "blue")
abline(v= mean(sampling.mean.dc), col = "red")
text(47, 150, "Red = mean of \n sampling distribution", col = "red")
x.ht <- seq(30, 50, by = .1)
y.ht <- dnorm(x.ht, mean = mean(sampling.mean.dc), sd = sd(sampling.mean.dc))
lines(x.ht, y.ht)
```
##Binomial to Normal
This concept is really just invoking the CLT again.
Let's use the owl sex data - we'll first simulate taking a bunch of random samples to get the $\mu$ and $\sigma$.
The we'll use the shortcut the book showed us.
##Binomial to Normal
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE}
p.sex <- props(owls$owl_sex)
print(p.sex)
```
##Binomial to Normal
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE}
successes <- 0
for (i in 1:reps) {
samp <- rbinom(1, size = 20, prob = p.sex[1])
successes[i] <- samp
}
```
##Binomial to Normal
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE}
mean(successes)
sd(successes)
```
or use the shortcut
$mean = np$ and $SD = \sqrt{np(1-p)}$
```{r fig.align="center", fig.height = 4, cache = TRUE, echo=TRUE}
n = 20
p = .25
#mean
n*p
#SD
sqrt(n*p * (1-p))
```
##In fact...
These shortcuts work precisely BECAUSE of the CLT! Cool right?
Remember that $SE = SD$ of the Sampling distribution? It's because of the CLT.
Remember $95 \% CI \approx 2 \times SE$ ? It's because fo the CLT.
A large proportion of inferential statistics rely upon the CLT to work... as you will see in the remainder of the semester.