Problem set 3: Dog Names and People Names

Due by 6:00 PM on Friday, October 18, 2019

In this Problem Set, we’ll be looking at the Names of New York Dogs and American Humans

If you haven’t done so already, download the skeleton project that contains the NYC dogs data

You can clone it in GitHub if you’re familiar with that process:

Alternatively, download a zip file of the project:

When you download this file, make sure you unzip it rather than just looking inside it with your file manager.

Open the project in R Studio by double-clicking the nyc_tmp.Rproj file. Verify (in R Studio’s file viewer pane) that your project folder is structured like this:


Open the dogs.Rmd file and do the analysis in that.

Load the required libraries

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::is_null() masks testthat::is_null()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ dplyr::matches() masks tidyr::matches(), testthat::matches()
## Attaching package: 'socviz'
## The following object is masked from 'package:kjhutils':
##     %nin%
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0

Load the dogs data

If you copy this code chunk into your working file, remove eval = FALSE from them, otherwise it won’t run.


Take a look at the babynames data


Remember how we’ve been working with the NYC dogs data

Reproduce the code below in your own analysis.

boro_names <- c("Manhattan", "Queens", 
                "Brooklyn", "Bronx", 
                "Staten Island")
no_name <- c("Unknown", "Name Not Provided")

top_names <- nyc_license %>% 
  filter(borough %in% boro_names, 
         animal_name %nin% no_name) %>% 
  group_by(borough, animal_name) %>%
  tally() %>%
  mutate(freq = n / sum(n), 
         pct = freq * 100) %>% 
  top_n(5) %>%
p <- ggplot(data = top_names,
            mapping = aes(x = reorder(animal_name, pct), 
                          y = pct))

p + geom_point(size = 3) + 
  coord_flip() + 
  facet_wrap(~ reorder(borough, -pct), 
             scales = "free_y", ncol = 1) + 
  labs(x = NULL, 
       y = "Percent", 
       title = "Most Popular Dog Names") + 

Questions to answer

  1. Briefly explain in your own words what the sequence of operations from nyc_license %>% to arrange(desc(pct)) are doing to the data in the code chunk listed above.
  2. Produce a table and accompanying figure that shows the five most popular names for male and female dogs, instead of for each borough.
  3. Produce a table and accompanying figure that shows the five most popular names for male and female dogs, within each borough.
  4. Produce a map showing where dogs named Max are most commonly found in New York City.
  5. Now look at the babynames dataset. What were the five most popular names for girls in 1900? What where they in 2000?
  6. Make a figure tracing the popularity of any two of the following names: Shirley, Linda, Bittany, Emma, Ella, Mark, Oliver, Brayden, Michael. (Remember to filter the data by sex before trying to draw any plots.)
  7. Place the two names on the same plot.
  8. Can you think of a way to get a sense of the relative heterogeneity of boy vs girl names over time?


Knit the completed R Markdown file as a Word or PDF document (use the “Knit” button at the top of the script editor window). Save it with a name of the form lastname_firstname_ps02 and upload it to the Sakai dropbox.