Geekwire’s Monica Nickelsburg wrote Where do Seattle’s newcomers move from? Drivers license numbers reveal some surprises, with a pretty Excel chart showing the top states from which people move into King County.
But her chart doesn’t correct for the population of each state. Can we do better in R?
First, I downloaded the raw data from the Washington State Department of Licensing, which appears to be the source for her article.
Then I converted all the data to Tidy format:
library(tidyverse) dol_king <- readxl::read_excel(dol_path, sheet = "King", skip = 5) data(state) # read state abbreviations # load state populations census_pop <- read.csv(census_pop_path) # convert to Tidy dataset dol_king$From <- "Mississippi" # fix an error in the DOL spelling dol_king <- dol_king %>% setNames(stringr::str_replace(names(dol_king),"CY ","")) dol_king <- dol_king %>% gather(Year, Change, -From) # dol_king is a tidy dataframe (tibble) showing the number of people who moved to King County from each state between 2006-2017 # Now do the same with census_pop census_pop_no <- census_pop %>% select(starts_with("POPESTIMATE")) %>% tbl_df() # just the numbers for populations, not state names
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0. ## Please use `tibble::as_tibble()` instead. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_warnings()` to see where this warning was generated.
census_pop_no <- census_pop_no %>% setNames(stringr::str_replace_all(names(census_pop_no),"[:alpha:]*","")) census_pop <- cbind(select(census_pop,"NAME"),census_pop_no) %>% tbl_df() census_pop <- census_pop %>% gather(Year,Population,-NAME) census_pop$NAME <- as.character(census_pop$NAME) dol <- dplyr::left_join(census_pop,dol_king, by = c("NAME" = "From", "Year" = "Year")) dol$NAME <- factor(dol$NAME) names(dol) <- "From"
This gives me one handy variable,
dol, with each state and both its population as well as the number of people who moved to King County in each year.
## # A tibble: 416 x 4 ## From Year Population Change ## <fct> <chr> <int> <dbl> ## 1 Alabama 2010 4785579 180 ## 2 Alaska 2010 714015 679 ## 3 Arizona 2010 6407002 2045 ## 4 Arkansas 2010 2921737 210 ## 5 California 2010 37327690 10373 ## 6 Colorado 2010 5048029 1423 ## 7 Connecticut 2010 3580171 344 ## 8 Delaware 2010 899712 80 ## 9 District of Columbia 2010 605040 NA ## 10 Florida 2010 18846461 2780 ## # … with 406 more rows
Now it’s just a matter of applying simple calculations to normalize the data.
Let’s draw this as a heatmap, with darker colors representing small percentages of a population, and ligher colors representing larger percentages.
ggplot(data = dol, mapping = aes(x = Year, y = From, fill = Change/Population )) + geom_tile() + scale_y_discrete(limits = rev(levels(dol$From)))
The lighter the color, the higher the percentage of people from that state (and year) who are moving to King County. Represented this way, a few states stand out: Alaska and Oregon, for example. Although their overall populations are relatively small, lots of people move here from there. By comparison, relatively few residents of large states like California or Texas move here.
Interestingly, a non-obvious standout is Hawaii. I don’t normally think of Hawaiians as likely to move to Seattle, but percentage-wise they’re pretty high. In fact, for the last few years the average Hawaiian is more likely to move here than the average Idahoan. Go figure.
You can also see a few trends over time. For example, although both Montana and Idaho have sent a fair share of people here since since the early 2010s, their enthusiasm seems to have waned in the past few years. Similarly, Nevadans I guess decided to slow down too.
It’s a big country, so I wouldn’t read too much into this information – it’s not as though there’s a stampede in one direction or the other. Just normal people doing normal things.