Problem set 3: Dog Names and People Names
Due by 6:00 PM on Friday, October 18, 2019
In this Problem Set, we’ll be looking at the Names of New York Dogs and American Humans
If you haven’t done so already, download the skeleton project that contains the NYC dogs data
You can clone it in GitHub if you’re familiar with that process: https://github.com/kjhealy/nyc_tmp
Alternatively, download a zip file of the project: https://github.com/kjhealy/nyc_tmp/archive/master.zip
When you download this file, make sure you unzip it rather than just looking inside it with your file manager.
Open the project in R Studio by double-clicking the
nyc_tmp.Rproj file. Verify (in R Studio’s file viewer pane) that your project folder is structured like this:
``` nyc_tmp/ data/ nyc_bites.rda nyc_license.rda nyc_zips.rda .gitignore .here dogs.Rmd nyc_tmp.Rproj ```
dogs.Rmd file and do the analysis in that.
Load the required libraries
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2 ## ✔ tibble 2.1.3 ✔ dplyr 0.8.3 ## ✔ tidyr 1.0.0 ✔ stringr 1.4.0 ## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ purrr::is_null() masks testthat::is_null() ## ✖ dplyr::lag() masks stats::lag() ## ✖ dplyr::matches() masks tidyr::matches(), testthat::matches()
## ## Attaching package: 'socviz'
## The following object is masked from 'package:kjhutils': ## ## %nin%
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
Load the dogs data
If you copy this code chunk into your working file, remove
eval = FALSE from them, otherwise it won’t run.
Take a look at the babynames data
Remember how we’ve been working with the NYC dogs data
Reproduce the code below in your own analysis.
boro_names <- c("Manhattan", "Queens", "Brooklyn", "Bronx", "Staten Island") no_name <- c("Unknown", "Name Not Provided") top_names <- nyc_license %>% filter(borough %in% boro_names, animal_name %nin% no_name) %>% group_by(borough, animal_name) %>% tally() %>% mutate(freq = n / sum(n), pct = freq * 100) %>% top_n(5) %>% arrange(desc(pct)) p <- ggplot(data = top_names, mapping = aes(x = reorder(animal_name, pct), y = pct)) p + geom_point(size = 3) + coord_flip() + facet_wrap(~ reorder(borough, -pct), scales = "free_y", ncol = 1) + labs(x = NULL, y = "Percent", title = "Most Popular Dog Names") + theme_minimal()
Questions to answer
- Briefly explain in your own words what the sequence of operations from
arrange(desc(pct))are doing to the data in the code chunk listed above.
- Produce a table and accompanying figure that shows the five most popular names for male and female dogs, instead of for each borough.
- Produce a table and accompanying figure that shows the five most popular names for male and female dogs, within each borough.
- Produce a map showing where dogs named Max are most commonly found in New York City.
- Now look at the
babynamesdataset. What were the five most popular names for girls in 1900? What where they in 2000?
- Make a figure tracing the popularity of any two of the following names: Shirley, Linda, Bittany, Emma, Ella, Mark, Oliver, Brayden, Michael. (Remember to filter the data by sex before trying to draw any plots.)
- Place the two names on the same plot.
- Can you think of a way to get a sense of the relative heterogeneity of boy vs girl names over time?
Knit the completed R Markdown file as a Word or PDF document (use the “Knit” button at the top of the script editor window). Save it with a name of the form
lastname_firstname_ps02 and upload it to the Sakai dropbox.