R-tude: Averages and Standard Deviations

R
Published

November 30, 2018

An etude is a short musical composition intended for intensive practice on a particular skill. My R-tudes are similar exercises to help me develop my R skills.

Code
library(tidyverse)

Let’s say you have a large population of people, each of differing abilities on a scale from 0 to 1, and let’s start with the assumption that it’s completely random. Some people are higher, some lower, and the number at each level are roughly equal.

Code
d <- runif(10000)
qplot(d,bins = 10, binwidth = 0.1, geom = "histogram", fill = I("blue"), col = I("black"), alpha = I(0.5))

It’s random, so you have roughly the same number of items in each histogram bucket.

This kind of distribution makes sense in, say, a world where people are choosing a number from 1 to 100 or something completely random. But that’s not a typical real world situation. A more typical example might involve, say, height:

Code
people_data <- data_frame(height = rnorm(100000,mean = (78.9 + 68.5) / 2, sd = (2.24 + 2.16) /2),
                     person = 1:100000)
ggplot(data = people_data, aes(height)) +
  geom_histogram(fill = "blue", col = "black", alpha = 0.5, bins = 20, breaks = seq(50,100)) +
  labs(title = "Histogram for height")

The spread of the curve depends on the standard deviation. The standard error of the mean for Americans is about 2.2 according to the CDC.

Subpopulations

The above example puts everybody in the same pile. Now let’s consider what happens when the people are divided into groups. For example, let’s assume half the people are in the ‘M’ group and half in the “F” group.

Code
pop_size = 10000


people_data <- data_frame(person = 1:pop_size,
                          group = factor(ifelse(rbinom(1:pop_size,1,.5),"M","F")),
                          height = rnorm(pop_size,mean = (78.9 + 68.5) / 2, sd = (2.24 + 2.16) /2)
                          )

m <- people_data[people_data$group == "M",]$height %>% length()
f <- people_data[people_data$group == "F",]$height %>% length()


people_data[people_data$group=="M","height"] <- rnorm(m,mean=78.9,sd=10)
people_data[people_data$group=="F","height"] <- rnorm(f,mean=68.5,sd=10)


ggplot(data = people_data, aes(height,fill=group)) +
  geom_histogram( col = "black", alpha = 0.5, bins = 20, breaks = seq(50,100)) +
  labs(title = "Histogram for height") 

ggplot(data = people_data, aes(height,fill = group)) +
  geom_density(alpha = 0.3)

You can watch this on an interactive graph at: