Problem set 6: FARS data

Due by 11:59 PM on Sunday, December 15, 2019

In this final Problem Set, we’ll take a look at some data from the Fatal Accident Reporting System, a data source on vehicle accidents in the United States

The data is available as an R package. To install it, do the following.

If you haven’t already, install drat:

  1. Install the drat package with install.packages(drat)

Then load it:

  1. library(drat)
  2. Add the repository where the data is: drat::addRepo("kjhealy")

You can now install farsdata with

  1. install.packages("farsdata")

Create a project for the assignment

Open the project in RStudio and make an Rmd file for the analysis called something like farsdata.Rmd

Load the required libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::is_null() masks testthat::is_null()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ dplyr::matches() masks tidyr::matches(), testthat::matches()
library(socviz)
## 
## Attaching package: 'socviz'
## The following object is masked from 'package:kjhutils':
## 
##     %nin%
library(farsdata)

Take a look at the data

There are three datasets in the package. You can get a brief summary of each by looking at the Help file in RStudio for the farsdata package, or by looking at the documentation on the package homepage: http://kjhealy.github.io/farsdata.

persons
## # A tibble: 405 x 4
##    sex    race                                   year      n
##    <chr>  <chr>                                  <chr> <dbl>
##  1 Male   Hispanic                               2004   3857
##  2 Male   White, Non-Hispanic                    2004  17707
##  3 Male   Black, Non-Hispanic                    2004   3190
##  4 Male   American Indian, Non-Hispanic/Unknown  2004    415
##  5 Male   Asian, Non-Hispanic/Unknown            2004    282
##  6 Male   Pacific Islander, Non-Hispanic/Unknown 2004     37
##  7 Male   Multiple Races, Non-Hispanic/Unknown   2004     30
##  8 Male   All Other Non-Hispanic or Race         2004    412
##  9 Male   Unknown Race and Unknown Hispanic      2004   3513
## 10 Female Hispanic                               2004   1312
## # … with 395 more rows
agetimes
## # A tibble: 1,875 x 5
##    age   time          time_fct      year      n
##    <chr> <chr>         <ord>         <chr> <dbl>
##  1 0-15  0:00am-0:59am 0:00am-0:59am 2004     53
##  2 0-15  1:00am-1:59am 1:00am-1:59am 2004     55
##  3 0-15  2:00am-2:59am 2:00am-2:59am 2004     36
##  4 0-15  3:00am-3:59am 3:00am-3:59am 2004     30
##  5 0-15  4:00am-4:59am 4:00am-4:59am 2004     18
##  6 0-15  5:00am-5:59am 5:00am-5:59am 2004     24
##  7 0-15  6:00am-6:59am 6:00am-6:59am 2004     42
##  8 0-15  7:00am-7:59am 7:00am-7:59am 2004    110
##  9 0-15  8:00am-8:59am 8:00am-8:59am 2004     72
## 10 0-15  9:00am-9:59am 9:00am-9:59am 2004     65
## # … with 1,865 more rows
vehicles
## # A tibble: 945 x 5
##    vehicle_type           year involving    yes    no
##    <chr>                 <int> <chr>      <dbl> <dbl>
##  1 Passenger Car          2004 distracted  2864 22818
##  2 Light Truck - Pickup   2004 distracted  1365  9489
##  3 Light Truck - Utility  2004 distracted   931  6903
##  4 Light Truck - Van      2004 distracted   460  3227
##  5 Light Truck - Other    2004 distracted    13    98
##  6 Large Truck            2004 distracted   808  4094
##  7 Motorcycle             2004 distracted   420  3701
##  8 Bus                    2004 distracted    40   239
##  9 Other/Unknown          2004 distracted    92  1167
## 10 Passenger Car          2005 distracted  2604 22565
## # … with 935 more rows

How are the vehicles data organized? We have seven kinds of crash condition, measured by the variable involving. These are distracted, drowsy, older, pedestrian, rollover, speeding, and younger. For each of those crash conditions, we have a count of the number of fatal crashes that involved (yes) or did not involve (no) the condition. These counts are broken out by vehicle_type (Passenger Car, Pickup Truck, Motorcycle, etc). Finally we have these counts for each year from 2004 to 2018.

Understanding the Data

Take a look at this code:

crashes %>%
   filter(vehicle_type == "Bus", year == 2004) %>% 
   mutate(total = yes + no)
## # A tibble: 7 x 6
##   vehicle_type  year involving    yes    no total
##   <chr>        <int> <chr>      <dbl> <dbl> <dbl>
## 1 Bus           2004 distracted    40   239   279
## 2 Bus           2004 drowsy         4   275   279
## 3 Bus           2004 older         57   222   279
## 4 Bus           2004 pedestrian    82   197   279
## 5 Bus           2004 rollover      17   262   279
## 6 Bus           2004 speeding      48   231   279
## 7 Bus           2004 younger       28   251   279

And then this code:

crashes %>% 
   filter(involving == "distracted", year == 2004) %>% 
   mutate(total = yes + no)
## # A tibble: 9 x 6
##   vehicle_type           year involving    yes    no total
##   <chr>                 <int> <chr>      <dbl> <dbl> <dbl>
## 1 Passenger Car          2004 distracted  2864 22818 25682
## 2 Light Truck - Pickup   2004 distracted  1365  9489 10854
## 3 Light Truck - Utility  2004 distracted   931  6903  7834
## 4 Light Truck - Van      2004 distracted   460  3227  3687
## 5 Light Truck - Other    2004 distracted    13    98   111
## 6 Large Truck            2004 distracted   808  4094  4902
## 7 Motorcycle             2004 distracted   420  3701  4121
## 8 Bus                    2004 distracted    40   239   279
## 9 Other/Unknown          2004 distracted    92  1167  1259
  1. What’s the difference between these two totals? Why does the first always sum to the same number, but the second sums to different numbers? What does this tell you about the data you are working with, and the relationship between the vehicle_type and involving variables?

Exploring the Vehicle Data (vehicles)

  1. In vehicles, choose one crash condition. Graph the count of crashes for this condition over time for each vehicle type.
  2. Examine the differences between graphing counts of crashes and graphing proportions of crashes. When might it be useful to use one rather than the other?
  3. Are there crash conditions that are more common in specific vehicle types? Are there any interesting trends in crash conditions by vehicle types over time?

Exploring the Persons Data (persons)

  1. Are there any race/ethnicity-specific trends in these data?
  2. Are men or women more likely to be killed in accidents? Is there any change over time in the difference between men and women’s fatality rates?

Exploring the Age/Time Data (agetimes)

Hint: There are two “hours” variables in agetimes, time and time_fct. The former is a character variable; the latter is an ordered factor, with “Unknown Hours” coded as NA, or missing.

  1. Calculate the proportion or percentage of fatalities across hours, for each age group.
  2. Using either time or time_fct plot the time variation across age groups in at least two different ways. What sort of visualization do you think is most effective, and why?
  3. Can the “Unknown Hours” category be safely ignored when plotting the data, or not? Say why or why not.

Finish

Knit the completed R Markdown file as an HTML, Word, or PDF document (use the “Knit” button at the top of the script editor window). Save it with a name of the form lastname_firstname_ps06 and upload it to the Sakai dropbox.