Notes for October 23rd/26th: Mapping Functions

Materials for class on Sunday, October 27, 2019

Contents

Mapping functions to subsets of your data

In class we talked about the general strategy of table-based data analysis. You start with your data in some sort of tabular layout—perhaps spread across several tables, or perhaps just one very large table—and as you explore it you group or split the data into sub-units of some sort (in effect: smaller tables) and then analyze those sub-units in some way. The analysis can range from the very straightforward (e.g., “How many of these things are there?”, which is what tally() is for) to the more complex (e.g., “I want to fit some sort of statistical model here”). In the process of doing the analysis, whatever it is, we will be performing various sorts of operations on our tables. Some of them involve adding columns to the table. This is what mutate() is for. And some of them involve producing some sort of summary of the table, which is what summarize() is for.

If you think about it, mutate() and summarize() are not really things you do to the data. Rather, inside them we specify the function that does the thing we want, such as using mutate() to create a new column that adds the values of two existing columns together, for example. Or using summarize() to specify that we want the mean() or median() of some grouped data.

Quick Review

First let’s set up some libraries and load some data.

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::is_null() masks testthat::is_null()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ dplyr::matches() masks tidyr::matches(), testthat::matches()
library(broom)
library(socviz)
## 
## Attaching package: 'socviz'
## The following object is masked from 'package:kjhutils':
## 
##     %nin%
library(congress)
library(gapminder)

The Gapminder data is very familiar at this point:

gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

We know that we can perform mutating actions on the whole dataset at once—that is, actions that add one or more columns to the table but don’t change the number of rows:

gapminder %>% 
   mutate(pop_m = pop/1e6) # population in millons
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap pop_m
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  8.43
##  2 Afghanistan Asia       1957    30.3  9240934      821.  9.24
##  3 Afghanistan Asia       1962    32.0 10267083      853. 10.3 
##  4 Afghanistan Asia       1967    34.0 11537966      836. 11.5 
##  5 Afghanistan Asia       1972    36.1 13079460      740. 13.1 
##  6 Afghanistan Asia       1977    38.4 14880372      786. 14.9 
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12.9 
##  8 Afghanistan Asia       1987    40.8 13867957      852. 13.9 
##  9 Afghanistan Asia       1992    41.7 16317921      649. 16.3 
## 10 Afghanistan Asia       1997    41.8 22227415      635. 22.2 
## # … with 1,694 more rows

And also summarizing actions on the whole dataset at once—that is, actions that calculate some statistic (however simple or complex) that results in a summary table that’s smaller than the original:

gapminder %>% 
   summarize(pop_mean = mean(pop)) 
## # A tibble: 1 x 1
##    pop_mean
##       <dbl>
## 1 29601212.

We have also been doing a lot of grouping-and-summarizing:

gapminder %>% 
   group_by(continent) %>%
   summarize(le_mean = mean(lifeExp)) 
## # A tibble: 5 x 2
##   continent le_mean
##   <fct>       <dbl>
## 1 Africa       48.9
## 2 Americas     64.7
## 3 Asia         60.1
## 4 Europe       71.9
## 5 Oceania      74.3

Nesting your data

Sometimes, instead of just grouping our data, we want a kind of “super-grouping”. The motivation is that we want to perform some series of actions on subgroup of our data, but those actions don’t necessarily result in output that can immediately be slotted into a single table. This happens, for example, when we want to do some statistical analysis and we get the results by subgroup, but those results contain all kinds of bits and pieces that aren’t single variables that. In these cases we take advantage of the our ability to nest data and store results in list columns. The simplest sort of list column is just a miniature table of data for the row we’re working with. Here’s gapminder again:

out_le <- gapminder %>%
    group_by(continent, year) %>%
    nest()

out_le
## # A tibble: 60 x 3
## # Groups:   continent, year [60]
##    continent  year           data
##    <fct>     <int> <list<df[,4]>>
##  1 Asia       1952       [33 × 4]
##  2 Asia       1957       [33 × 4]
##  3 Asia       1962       [33 × 4]
##  4 Asia       1967       [33 × 4]
##  5 Asia       1972       [33 × 4]
##  6 Asia       1977       [33 × 4]
##  7 Asia       1982       [33 × 4]
##  8 Asia       1987       [33 × 4]
##  9 Asia       1992       [33 × 4]
## 10 Asia       1997       [33 × 4]
## # … with 50 more rows

We’ve reorganized our data into continent-year rows, each of which contains a column identifying the continent, a column identifying the year, and then a special list column with the other three variables stored as a little mini-tibble. By default it is called data. Think of what nest() does as a more intensive version what group_by() does. We can look at what’s inside any given row’s data list column by filtering and unnesting it:

out_le %>% 
   filter(continent == "Europe" & year == 1977) %>% 
   unnest(cols = c(data))
## # A tibble: 30 x 6
## # Groups:   continent, year [5]
##    continent  year country                lifeExp      pop gdpPercap
##    <fct>     <int> <fct>                    <dbl>    <int>     <dbl>
##  1 Europe     1977 Albania                   68.9  2509048     3533.
##  2 Europe     1977 Austria                   72.2  7568430    19749.
##  3 Europe     1977 Belgium                   72.8  9821800    19118.
##  4 Europe     1977 Bosnia and Herzegovina    69.9  4086000     3528.
##  5 Europe     1977 Bulgaria                  70.8  8797022     7612.
##  6 Europe     1977 Croatia                   70.6  4318673    11305.
##  7 Europe     1977 Czech Republic            70.7 10161915    14800.
##  8 Europe     1977 Denmark                   74.7  5088419    20423.
##  9 Europe     1977 Finland                   72.5  4738902    15605.
## 10 Europe     1977 France                    73.8 53165019    18293.
## # … with 20 more rows

List columns are useful because we can act on them in a compact and tidy way. In particular, we can pass functions along to each row of the list column and make something happen. For simple operations, just using group_by() often works fine. But if we want to do something a little more complicated, nesting the columns we want and then mapping a function to each piece will be more effective.

For example, we might want to try to predict life expectancy using GDP per capita. Let’s say we wanted to see how a very simple model of this sort performed when applied to each continent-year subset. We can write a little custom, one-off function that says “fit the model we want to some data”. Like this:

fit_ols <- function(df) {
    lm(lifeExp ~ log(gdpPercap), data = df)
}

Now we have an object called fit_ols(). It’s a function, and what it does is try to find the best-fitting linear relationship between lifeExp and the log of gdpPercap for some set of data, which we’re referring to by the placeholder df. Rhe function requires us to specify df whenever we call it, because otherwise it won’t know what to do.

We can fit it on the whole Gapminder dataset if we like:

out <- fit_ols(df = gapminder)

summary(out)
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.778  -4.204   1.212   4.658  19.285 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.1009     1.2277  -7.413 1.93e-13 ***
## log(gdpPercap)   8.4051     0.1488  56.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared:  0.6522, Adjusted R-squared:  0.652 
## F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16

But what we really want to do is to apply or map this function to every one of our continent-year subsets. That is, to the data piece of every row of the out_le object:

out_le
## # A tibble: 60 x 3
## # Groups:   continent, year [60]
##    continent  year           data
##    <fct>     <int> <list<df[,4]>>
##  1 Asia       1952       [33 × 4]
##  2 Asia       1957       [33 × 4]
##  3 Asia       1962       [33 × 4]
##  4 Asia       1967       [33 × 4]
##  5 Asia       1972       [33 × 4]
##  6 Asia       1977       [33 × 4]
##  7 Asia       1982       [33 × 4]
##  8 Asia       1987       [33 × 4]
##  9 Asia       1992       [33 × 4]
## 10 Asia       1997       [33 × 4]
## # … with 50 more rows

And we can do that with the map() function. Let’s start over:

out_le <- gapminder %>%
  group_by(continent, year) %>%
  nest() %>%
  mutate(model = map(data, fit_ols))

out_le
## # A tibble: 60 x 4
## # Groups:   continent, year [60]
##    continent  year           data model 
##    <fct>     <int> <list<df[,4]>> <list>
##  1 Asia       1952       [33 × 4] <lm>  
##  2 Asia       1957       [33 × 4] <lm>  
##  3 Asia       1962       [33 × 4] <lm>  
##  4 Asia       1967       [33 × 4] <lm>  
##  5 Asia       1972       [33 × 4] <lm>  
##  6 Asia       1977       [33 × 4] <lm>  
##  7 Asia       1982       [33 × 4] <lm>  
##  8 Asia       1987       [33 × 4] <lm>  
##  9 Asia       1992       [33 × 4] <lm>  
## 10 Asia       1997       [33 × 4] <lm>  
## # … with 50 more rows

Notice that the job of mutate() is the same as always. We’re using it to add a column to our main table. In this case, we call it model. But instead of being the direct result of a bit of arithmetic whatever, we use map() to apply the fit_ols function to every row of the data list-column. Every one of those rows is a little mini-tibble, remember. And the fit_ols() function expects one argument, a table of data with columns named lifeExp and gdpPercap. The map() function feeds data to fit_ols for each row of the grouped Gapminder data. The function does its job and returns a linear model object, and which gets added to the model row of the nested Gapminder dataset. So now we have a list column named data and a list-column named model. The value of the list column format should be clearer here, because unlike a simpler function or operation, the thing that fit_ols returns is not a simple table that can be added directly to our main data table.

When the results our functions are inside of list-columns like this, they’re not much use to us. But the idea isn’t to just store them there. Instead, we can do further work on them and have them produce results that can fit into a nice tidy table. For example, we can use broom’s tidy() function to get some summaries from the models we just fit. Again, we use map:

out_tidy <- gapminder %>%
    group_by(continent, year) %>%
    nest() %>% 
    mutate(model = map(data, fit_ols),
           tidied = map(model, tidy)) %>%
    unnest(cols = c(tidied)) 

out_tidy
## # A tibble: 120 x 9
## # Groups:   continent, year [60]
##    continent  year     data model term  estimate std.error statistic
##    <fct>     <int> <list<d> <lis> <chr>    <dbl>     <dbl>     <dbl>
##  1 Asia       1952 [33 × 4] <lm>  (Int…    15.8       9.27      1.71
##  2 Asia       1952 [33 × 4] <lm>  log(…     4.16      1.25      3.33
##  3 Asia       1957 [33 × 4] <lm>  (Int…    18.1       9.70      1.86
##  4 Asia       1957 [33 × 4] <lm>  log(…     4.17      1.28      3.26
##  5 Asia       1962 [33 × 4] <lm>  (Int…    16.6       9.52      1.74
##  6 Asia       1962 [33 × 4] <lm>  log(…     4.59      1.24      3.72
##  7 Asia       1967 [33 × 4] <lm>  (Int…    19.8       9.05      2.19
##  8 Asia       1967 [33 × 4] <lm>  log(…     4.50      1.15      3.90
##  9 Asia       1972 [33 × 4] <lm>  (Int…    21.9       8.14      2.69
## 10 Asia       1972 [33 × 4] <lm>  log(…     4.44      1.01      4.41
## # … with 110 more rows, and 1 more variable: p.value <dbl>

Look at that. Now our table has new columns containing information from the model (estimate, std.error, etc), and also new rows. Double the number of rows, in fact, because within each continent-year observation, we have these estimates and standard errors and so on for each term in the model. In this case, two terms: an estimate of the intercept, and an estimate of the coefficient (i.e. the slope) of the log(gdpPercap) variable. Let’s forget about the intercept:

out_tidy <- gapminder %>%
    group_by(continent, year) %>%
    nest() %>% 
    mutate(model = map(data, fit_ols),
           tidied = map(model, tidy)) %>%
    unnest(cols = c(tidied)) %>%
    filter(term %nin% "(Intercept)")

out_tidy
## # A tibble: 60 x 9
## # Groups:   continent, year [60]
##    continent  year     data model term  estimate std.error statistic
##    <fct>     <int> <list<d> <lis> <chr>    <dbl>     <dbl>     <dbl>
##  1 Asia       1952 [33 × 4] <lm>  log(…     4.16     1.25       3.33
##  2 Asia       1957 [33 × 4] <lm>  log(…     4.17     1.28       3.26
##  3 Asia       1962 [33 × 4] <lm>  log(…     4.59     1.24       3.72
##  4 Asia       1967 [33 × 4] <lm>  log(…     4.50     1.15       3.90
##  5 Asia       1972 [33 × 4] <lm>  log(…     4.44     1.01       4.41
##  6 Asia       1977 [33 × 4] <lm>  log(…     4.87     1.03       4.75
##  7 Asia       1982 [33 × 4] <lm>  log(…     4.78     0.852      5.61
##  8 Asia       1987 [33 × 4] <lm>  log(…     5.17     0.727      7.12
##  9 Asia       1992 [33 × 4] <lm>  log(…     5.09     0.649      7.84
## 10 Asia       1997 [33 × 4] <lm>  log(…     5.11     0.628      8.15
## # … with 50 more rows, and 1 more variable: p.value <dbl>

We now have tidy regression output with an estimate of the association between log GDP per capita and life expectancy for each year, within continents. We can plot these estimates in a way that takes advantage of their groupiness, perform further calculations on them, and so on.

Principal Components Example

A few of you are doing some PCA with Professor Mukherjee. Principal Component Analysis is a technique that’s quite closely related to linear regression. It’s often used in situations where we have too many variables or “features” in our data to easily understand, and we suspect they might tend to group together into a much smaller number of underlying factors or tendencies or components. In a PCA approach, we apply a series of transformations to the data in order to find the “best” set of underlying components. In particular we want the dimensions we choose to be orthogonal to one another—that is, uncorrelated. PCA is an inductive approach to data analysis. Just because of the way it works, we’re arithmetically guaranteed to find a set of components that “explain” all the variance we observe. The substantive question is whether the components uncovered by the PCA have a plausible interpretation.

PCA on the Midwest data

The Gapminder data isn’t really well-suited to this approach, just because we have relatively few variables. (There are only three numeric variables.) Instead, lets use the midwest data that we’ve seen once or twice before. Remember what it looks like:

midwest
## # A tibble: 437 x 28
##      PID county state  area poptotal popdensity popwhite popblack
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702
##  2   562 ALEXA… IL    0.014    10626       759      7054     3496
##  3   563 BOND   IL    0.022    14991       681.    14477      429
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127
##  5   565 BROWN  IL    0.018     5836       324.     5264      547
##  6   566 BUREAU IL    0.05     35688       714.    35157       50
##  7   567 CALHO… IL    0.017     5322       313.     5298        1
##  8   568 CARRO… IL    0.027    16805       622.    16519      111
##  9   569 CASS   IL    0.024    13437       560.    13384       16
## 10   570 CHAMP… IL    0.058   173025      2983.   146506    16559
## # … with 427 more rows, and 20 more variables: popamerindian <int>,
## #   popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>

There are a bunch of US counties from Midwestern states, and we have a whole bunch of numeric measures from the Census. Let’s group them by state and then select just the numeric measures:.

mw_pca <- midwest %>%
    group_by(state) %>%
    select_if(is.numeric) %>%
    select(-PID)
    
mw_pca
## # A tibble: 437 x 25
## # Groups:   state [5]
##    state  area poptotal popdensity popwhite popblack popamerindian popasian
##    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>    <int>
##  1 IL    0.052    66090      1271.    63917     1702            98      249
##  2 IL    0.014    10626       759      7054     3496            19       48
##  3 IL    0.022    14991       681.    14477      429            35       16
##  4 IL    0.017    30806      1812.    29344      127            46      150
##  5 IL    0.018     5836       324.     5264      547            14        5
##  6 IL    0.05     35688       714.    35157       50            65      195
##  7 IL    0.017     5322       313.     5298        1             8       15
##  8 IL    0.027    16805       622.    16519      111            30       61
##  9 IL    0.024    13437       560.    13384       16             8       23
## 10 IL    0.058   173025      2983.   146506    16559           331     8033
## # … with 427 more rows, and 17 more variables: popother <int>,
## #   percwhite <dbl>, percblack <dbl>, percamerindan <dbl>,
## #   percasian <dbl>, percother <dbl>, popadults <int>, perchsd <dbl>,
## #   percollege <dbl>, percprof <dbl>, poppovertyknown <int>,
## #   percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>

Note the use of select_if() there. We also drop PID because although it’s numeric, it’s a case identifier, not a measured variable.

Now let’s write a PCA helper function that’s specific to the data we’re working with. As with fit_ols() it takes some data, df and then does the thing we want to that data—in this case, fit a PCA using the prcomp function.

do_pca <- function(df){
  prcomp(df,
         center = TRUE, scale = TRUE)
}

The center and scale arguments are for prcomp. PCA results are sensitive to how variables are measured, so it is conventional to center them (by subtracting the mean of each one) and scale them (by dividing by the standard deviations). This makes the resulting numerical values more directly comparable

As before we could do a PCA on the whole dataset:

out_pca <- mw_pca %>%
    ungroup() %>%
    select(-state) %>%
    do_pca()

If you print the results of a PCA analysis to the console, you will see a square table of numbers. The rows of this table will have the same names as the columns from our original data, i.e., the variables. The columns are the orthogonal principal components, named PC1 to PC24 in this case. (Each column is an eigenvector.) There will be as many components as variables. Ideally, the first few components will “explain” most of the variation in the data, and the way that the original variables are associated with each component will have some sort of substantively plausible interpretation.

Here’s peek at the components for the whole dataset:

out_pca
## Standard deviations (1, .., p=24):
##  [1] 3.098601e+00 2.209601e+00 1.649472e+00 1.192893e+00 1.121586e+00
##  [6] 8.977640e-01 8.859441e-01 8.194829e-01 6.921234e-01 5.649742e-01
## [11] 5.439431e-01 4.854143e-01 3.799956e-01 3.583348e-01 3.094795e-01
## [16] 2.500930e-01 2.087861e-01 1.924391e-01 9.654481e-02 3.473006e-02
## [21] 1.328238e-02 3.862376e-03 2.886151e-09 5.193302e-16
## 
## Rotation (n x k) = (24 x 24):
##                               PC1          PC2          PC3         PC4
## area                 -0.026194151  0.014996226 -0.103615981  0.25080525
## poptotal             -0.312990340  0.051517565  0.110772033  0.05437383
## popdensity           -0.292047651  0.023597548  0.045576790 -0.10421671
## popwhite             -0.314102501  0.023751463  0.081172077  0.02779283
## popblack             -0.287733756  0.107425973  0.149017651  0.05977198
## popamerindian        -0.269188348  0.091620270 -0.012129451 -0.12363778
## popasian             -0.284059460  0.057769639  0.145218465  0.21322836
## popother             -0.258309265  0.080662808  0.196512506  0.21628387
## percwhite             0.182514833 -0.175573945  0.254864514  0.43714304
## percblack            -0.198800629  0.111941861 -0.153186027 -0.25421194
## percamerindan        -0.001534117  0.174540376 -0.184268646 -0.40652737
## percasian            -0.191218173 -0.127231985 -0.332956738  0.20775316
## percother            -0.173671168 -0.050556657  0.030173428 -0.09437040
## popadults            -0.312119508  0.052857188  0.116377419  0.05450157
## perchsd              -0.094221446 -0.355863355 -0.181247085 -0.04333187
## percollege           -0.153071342 -0.255449911 -0.344841584  0.08121518
## percprof             -0.143077470 -0.206478282 -0.392783145  0.16013389
## poppovertyknown      -0.312587935  0.052236356  0.114220833  0.05273857
## percpovertyknown      0.016028294  0.007402401  0.359748016 -0.34952458
## percbelowpoverty      0.021855622  0.402909415 -0.237353423  0.08207085
## percchildbelowpovert  0.009476823  0.412110822 -0.150115286 -0.04124745
## percadultpoverty      0.014088458  0.358988477 -0.309021903  0.15457501
## percelderlypoverty    0.085358935  0.368070961  0.007658326  0.18553385
## inmetro              -0.137665882 -0.192436724 -0.125070989 -0.33100508
##                                PC5          PC6          PC7           PC8
## area                 -0.6268042152  0.486893708 -0.453569767  0.1079319668
## poptotal              0.0024970161  0.040578465  0.048861233 -0.0285013319
## popdensity            0.1390525473  0.151857172  0.069362791 -0.0664995961
## popwhite              0.0163175123  0.072589923  0.056167495 -0.0008575316
## popblack              0.0061929483  0.048796558  0.016831570 -0.1220243486
## popamerindian        -0.1960220830  0.141355661  0.052576144 -0.1347051621
## popasian             -0.0833969809 -0.162859041  0.128790221  0.0472430208
## popother             -0.1104914802 -0.262246435  0.039072780  0.0526606531
## percwhite             0.0823408813  0.075888521  0.129995786  0.1597830701
## percblack             0.3439790573  0.252961702 -0.383051501 -0.1771722325
## percamerindan        -0.5220486545 -0.288345089  0.324526942 -0.1461253649
## percasian             0.0575491870 -0.063473221  0.116589841  0.1804793473
## percother            -0.0218809886 -0.583801828 -0.595802991  0.3901700486
## popadults             0.0007915297  0.041900620  0.052095520 -0.0298336006
## perchsd              -0.1857555151  0.036568047  0.041004313  0.0396282112
## percollege           -0.0243188162  0.082166160  0.122560453  0.1811713006
## percprof              0.1159894026 -0.039804454  0.193789766  0.1010791912
## poppovertyknown       0.0015598333  0.041736565  0.049619713 -0.0273696954
## percpovertyknown     -0.0787444895  0.257253963  0.237599189  0.6826519171
## percbelowpoverty      0.0444121328 -0.003257345  0.053924973  0.1821190681
## percchildbelowpovert  0.0512103005  0.071824098 -0.005100555  0.2244817257
## percadultpoverty      0.0356366301 -0.037897304  0.076111737  0.1793070629
## percelderlypoverty    0.1419281187  0.082819310  0.024862129  0.0742401752
## inmetro               0.2183862033  0.157288149 -0.003008790  0.2202521313
##                              PC9         PC10        PC11        PC12
## area                  0.12663712 -0.094179408  0.11231535 -0.04313244
## poptotal              0.01552267  0.008775679 -0.01791350 -0.01501762
## popdensity           -0.13895834  0.342485375  0.17787235  0.05949232
## popwhite              0.01271539  0.052449201  0.06560517 -0.03436697
## popblack             -0.02692551 -0.004457095 -0.17937034  0.06297394
## popamerindian        -0.04109829  0.449228928  0.10953017  0.37974128
## popasian              0.10107752 -0.237369140 -0.02627899 -0.13675147
## popother              0.17717451 -0.305645557 -0.22440673 -0.08627011
## percwhite             0.13213941  0.284848619 -0.05544924  0.09470116
## percblack            -0.21602790 -0.307159882 -0.10823698 -0.15912937
## percamerindan         0.06453843 -0.127487786  0.11470316  0.04386861
## percasian            -0.07144517 -0.017682248  0.59303988 -0.36123161
## percother            -0.08942021  0.176666108  0.06711022  0.20749609
## popadults             0.01640764  0.005443480 -0.02055036 -0.02058595
## perchsd              -0.14299416  0.021982405 -0.54219851  0.08183392
## percollege           -0.18490572 -0.085836394 -0.15416795  0.21810997
## percprof             -0.12699026 -0.132671989 -0.01548882  0.17960053
## poppovertyknown       0.01462687  0.008363739 -0.01921287 -0.01472497
## percpovertyknown     -0.29522179 -0.137890254  0.02944298 -0.06807352
## percbelowpoverty      0.04715694  0.110133227 -0.14388516 -0.02624749
## percchildbelowpovert  0.03941614  0.196088069 -0.22772663 -0.22516992
## percadultpoverty      0.07247670  0.200967060 -0.19297958 -0.07394945
## percelderlypoverty   -0.04984767 -0.397177629  0.20448090  0.66565555
## inmetro               0.81787338 -0.053569239  0.02223327  0.13178465
##                               PC13         PC14         PC15         PC16
## area                  0.1492947689  0.049251461 -0.041581191  0.069182502
## poptotal              0.0809142843  0.094209562 -0.047035217 -0.019016740
## popdensity            0.1514018019  0.470174170  0.073725534  0.137872820
## popwhite              0.1025918501  0.227604890  0.155434954  0.125476506
## popblack              0.0891663464 -0.164520239 -0.647266067 -0.421509437
## popamerindian        -0.3335273686 -0.557968501  0.174302448  0.088617583
## popasian             -0.1235609452 -0.018281950  0.236492901  0.127410659
## popother             -0.1072405390 -0.206522022  0.192288737  0.064863701
## percwhite             0.0161116446  0.005368624  0.006960275 -0.006191143
## percblack            -0.0924593171 -0.140982560  0.064505197  0.054331701
## percamerindan         0.1254641220  0.143249174 -0.050450016 -0.034625963
## percasian            -0.4171948840 -0.014102307 -0.173226241 -0.114670694
## percother             0.0638451377  0.053806551 -0.051329033 -0.007277167
## popadults             0.0847188440  0.111948754 -0.027655022 -0.007651811
## perchsd              -0.5158899034  0.327909623 -0.181385662  0.230728906
## percollege            0.1647438288  0.017464859  0.418025355 -0.640037866
## percprof              0.4727125977 -0.293467485 -0.160774812  0.517382126
## poppovertyknown       0.0797510183  0.095501068 -0.047804340 -0.023710576
## percpovertyknown     -0.0000828557 -0.130539767 -0.099251884  0.047437659
## percbelowpoverty     -0.0261022587  0.042007626 -0.053588709  0.006592897
## percchildbelowpovert  0.0056653102  0.002253970  0.283931025  0.030674746
## percadultpoverty      0.0043048799  0.013080016 -0.225560922 -0.014777455
## percelderlypoverty   -0.2378679877  0.237106801 -0.030680141  0.038169302
## inmetro              -0.0385911067  0.011116430 -0.055901908 -0.025376419
##                              PC17        PC18          PC19          PC20
## area                 -0.080053175 -0.03173620  0.0086713439  0.0046807825
## poptotal              0.222846757  0.08703188 -0.0761931856  0.0230540916
## popdensity           -0.574012367 -0.28222971 -0.0054579629  0.0117038685
## popwhite              0.466363870  0.07768886 -0.1407278107  0.0302968726
## popblack             -0.255732749  0.19708095  0.1048230953  0.0086453550
## popamerindian         0.042838025 -0.03869656  0.0216649073  0.0046475411
## popasian             -0.097050949 -0.03676084  0.7810868830  0.0340203731
## popother             -0.315190248 -0.25802711 -0.5546364787 -0.0108903942
## percwhite            -0.006903346  0.04869608 -0.0073086312  0.0155809463
## percblack             0.025791904 -0.12477880  0.0346139218  0.0044765970
## percamerindan        -0.016219619  0.04163721 -0.0191989284 -0.0268487279
## percasian            -0.051979291  0.10724876 -0.1273361926 -0.0157805258
## percother             0.027054066  0.04714364  0.0490008640 -0.0020609153
## popadults             0.225752151  0.08733846 -0.0768707766 -0.0917147487
## perchsd              -0.001809384  0.09493895 -0.0150551778  0.0018112905
## percollege           -0.002842201 -0.07361662  0.0195221284  0.0079930821
## percprof             -0.080014048  0.13539216 -0.0053316027 -0.0068005627
## poppovertyknown       0.226822314  0.08897499 -0.0690617608  0.0035532809
## percpovertyknown      0.019687511 -0.10975064  0.0003994635  0.0008649907
## percbelowpoverty      0.046481431 -0.06957710 -0.0174105405  0.8250598558
## percchildbelowpovert -0.236560793  0.62012162 -0.0515270073 -0.2814040372
## percadultpoverty      0.216599873 -0.54943154  0.0903702718 -0.4610369991
## percelderlypoverty   -0.023343777  0.11150431 -0.0369451797 -0.1215578702
## inmetro              -0.041736731 -0.00287405  0.0326968965  0.0008015254
##                               PC21          PC22          PC23
## area                  0.0006100954  8.367129e-04  9.258598e-11
## poptotal              0.2623663997 -2.695286e-01  3.557253e-09
## popdensity            0.0067130253  3.678380e-03  2.870400e-10
## popwhite              0.3418239111 -3.428497e-01 -9.776444e-09
## popblack              0.1136442033 -1.399848e-01 -4.676845e-09
## popamerindian        -0.0078097728 -2.075393e-03 -1.201757e-10
## popasian              0.0182710315 -1.946581e-02 -3.914513e-10
## popother              0.0354980324 -2.634532e-02  7.874338e-11
## percwhite            -0.0010722219 -1.130704e-04  7.149193e-01
## percblack            -0.0013967824  3.417253e-04  5.180564e-01
## percamerindan         0.0048244342 -2.217079e-04  4.575489e-01
## percasian            -0.0055630731  8.872089e-04  6.333343e-02
## percother            -0.0043170883 -6.030141e-04  8.453324e-02
## popadults            -0.8790319357 -1.030783e-01 -1.770633e-08
## perchsd               0.0021093628 -1.113264e-04  2.265936e-10
## percollege           -0.0019738770 -1.664292e-03 -7.269330e-11
## percprof             -0.0020900394  3.814056e-03 -1.656475e-10
## poppovertyknown       0.1303632718  8.822909e-01  2.864555e-08
## percpovertyknown     -0.0012238472 -2.699457e-03 -2.291530e-10
## percbelowpoverty     -0.0830583366  4.745691e-03  4.430128e-09
## percchildbelowpovert  0.0335261456 -7.310973e-04 -8.287914e-10
## percadultpoverty      0.0416354842 -2.949434e-03 -2.967820e-09
## percelderlypoverty    0.0126368821 -1.055024e-03 -6.503165e-10
## inmetro              -0.0026678800 -5.150479e-05 -9.243588e-11
##                               PC24
## area                 -1.363506e-16
## poptotal              8.095608e-01
## popdensity           -1.526557e-16
## popwhite             -5.435525e-01
## popblack             -2.143790e-01
## popamerindian        -2.359217e-03
## popasian             -2.584333e-02
## popother             -5.030128e-02
## percwhite            -6.579316e-09
## percblack            -4.767611e-09
## percamerindan        -4.210767e-09
## percasian            -5.828499e-10
## percother            -7.779492e-10
## popadults             8.743006e-16
## perchsd               2.775558e-17
## percollege            8.326673e-17
## percprof             -4.857226e-17
## poppovertyknown       6.508682e-15
## percpovertyknown      1.665335e-16
## percbelowpoverty     -1.387779e-16
## percchildbelowpovert -1.526557e-16
## percadultpoverty      4.718448e-16
## percelderlypoverty    6.418477e-17
## inmetro              -2.775558e-16

You can also get a summary of the components:

summary(out_pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     3.0986 2.2096 1.6495 1.19289 1.12159 0.89776 0.8859
## Proportion of Variance 0.4001 0.2034 0.1134 0.05929 0.05241 0.03358 0.0327
## Cumulative Proportion  0.4001 0.6035 0.7168 0.77614 0.82856 0.86214 0.8948
##                            PC8     PC9   PC10    PC11    PC12    PC13
## Standard deviation     0.81948 0.69212 0.5650 0.54394 0.48541 0.38000
## Proportion of Variance 0.02798 0.01996 0.0133 0.01233 0.00982 0.00602
## Cumulative Proportion  0.92283 0.94278 0.9561 0.96841 0.97823 0.98425
##                           PC14    PC15    PC16    PC17    PC18    PC19
## Standard deviation     0.35833 0.30948 0.25009 0.20879 0.19244 0.09654
## Proportion of Variance 0.00535 0.00399 0.00261 0.00182 0.00154 0.00039
## Cumulative Proportion  0.98960 0.99359 0.99619 0.99801 0.99955 0.99994
##                           PC20    PC21     PC22      PC23      PC24
## Standard deviation     0.03473 0.01328 0.003862 2.886e-09 5.193e-16
## Proportion of Variance 0.00005 0.00001 0.000000 0.000e+00 0.000e+00
## Cumulative Proportion  0.99999 1.00000 1.000000 1.000e+00 1.000e+00

There’s a broom method for PCA, so we can tidy the results. The function can tidy up in various ways. We use the matrix argument to get tidy information on the principal components.

tidy_pca <- tidy(out_pca, matrix = "pcs")

tidy_pca
## # A tibble: 24 x 4
##       PC std.dev percent cumulative
##    <dbl>   <dbl>   <dbl>      <dbl>
##  1     1   3.10   0.400       0.400
##  2     2   2.21   0.203       0.603
##  3     3   1.65   0.113       0.717
##  4     4   1.19   0.0593      0.776
##  5     5   1.12   0.0524      0.829
##  6     6   0.898  0.0336      0.862
##  7     7   0.886  0.0327      0.895
##  8     8   0.819  0.0280      0.923
##  9     9   0.692  0.0200      0.943
## 10    10   0.565  0.0133      0.956
## # … with 14 more rows

You can see from the percent and cumulative columns that the first principal component accounts for 40% of the variance. The second accounts for about 20% by itself and 60% cumulatively, and so on. By the fourth component we’re up to 77 percent accounted for. Note again that although it’s conventional to say that the components “explain” the variance in the variables, this is something that’s mathematically guaranteed by the way that the calculation happens. Whether this purely formal sense of “explanation” translates into something more substantively explanatory is a separate question.

Now we can make a “scree plot”, showing the relative importance of the components. Ideally we’d like the first four or so to account for almost all the variance:

tidy_pca %>%
    ggplot(aes(x = PC, y = percent)) +
    geom_line() +
    labs(x = "Principal Component", y = "Variance Explained") 

Not bad.

PCA on the Midwest data, grouped by State

Now, Let’s say instead of doing a PCA on the whole dataset at once, we wanted to do it within each state instead. This is where our split-apply-combine approach comes in. First we take our mw_pca data and nest it within states:

mw_pca <- mw_pca %>%
    group_by(state) %>%
    nest()

mw_pca
## # A tibble: 5 x 2
## # Groups:   state [5]
##   state            data
##   <chr> <list<df[,24]>>
## 1 IL         [102 × 24]
## 2 IN          [92 × 24]
## 3 MI          [83 × 24]
## 4 OH          [88 × 24]
## 5 WI          [72 × 24]

Now we can do the PCA by group (i.e. by state) and tidy the results:

state_pca <- mw_pca %>% 
    mutate(pca = map(data, do_pca))

state_pca
## # A tibble: 5 x 3
## # Groups:   state [5]
##   state            data pca     
##   <chr> <list<df[,24]>> <list>  
## 1 IL         [102 × 24] <prcomp>
## 2 IN          [92 × 24] <prcomp>
## 3 MI          [83 × 24] <prcomp>
## 4 OH          [88 × 24] <prcomp>
## 5 WI          [72 × 24] <prcomp>

This gives us a new list column, pca, each row of which is an object that contains all the results of running prcomp(). We can add a second list column with the tidied summary. Again we’ll write a helper function to make what we’re doing a little more legible.

do_tidy <- function(pr){
    broom::tidy(pr, matrix = "pcs")
}
state_pca  <- mw_pca %>%
    mutate(pca = map(data, do_pca),
           pcs = map(pca, do_tidy)) 

state_pca
## # A tibble: 5 x 4
## # Groups:   state [5]
##   state            data pca      pcs              
##   <chr> <list<df[,24]>> <list>   <list>           
## 1 IL         [102 × 24] <prcomp> <tibble [24 × 4]>
## 2 IN          [92 × 24] <prcomp> <tibble [24 × 4]>
## 3 MI          [83 × 24] <prcomp> <tibble [24 × 4]>
## 4 OH          [88 × 24] <prcomp> <tibble [24 × 4]>
## 5 WI          [72 × 24] <prcomp> <tibble [24 × 4]>

The pcs list column contains the tidied summary of the PCA. We can unnest it, and draw a graph like before, only with state-level grouping:

state_pca %>%
    unnest(cols = c(pcs)) %>%
    ggplot(aes(x = PC, y = percent)) +
    geom_line(size = 1.1) +
    facet_wrap(~ state, nrow = 1) +
    labs(x = "Principal Component",
         y = "Variance Explained") 

We can also use the tools that broom gives us to see where the original data points (the counties) fall in the space created by the PCA. For this we’ll use broom’s augment() function. Augment returns tidy information at the level of the original observations (in this case, the counties in the midwest data). Again, a helper function for clarity.

do_aug <- function(pr){
    broom::augment(pr)
}

Let’s just recreate the whole object with the augmented data there as a third list column:

state_pca  <- mw_pca %>%
    mutate(pca = map(data, do_pca),
           pcs = map(pca, do_tidy),
           fitted = map(pca, do_aug)) 

state_pca
## # A tibble: 5 x 5
## # Groups:   state [5]
##   state            data pca      pcs               fitted             
##   <chr> <list<df[,24]>> <list>   <list>            <list>             
## 1 IL         [102 × 24] <prcomp> <tibble [24 × 4]> <tibble [102 × 25]>
## 2 IN          [92 × 24] <prcomp> <tibble [24 × 4]> <tibble [92 × 25]> 
## 3 MI          [83 × 24] <prcomp> <tibble [24 × 4]> <tibble [83 × 25]> 
## 4 OH          [88 × 24] <prcomp> <tibble [24 × 4]> <tibble [88 × 25]> 
## 5 WI          [72 × 24] <prcomp> <tibble [24 × 4]> <tibble [72 × 25]>

You can see that the tibbles in the fitted list column all have 25 columns (the 24 numeric variables in midwest + their id), and a varying number of rows. The number of rows is the number of counties in that state.

Now we plot the counties projected on to the first two principal components, faceted by state. We facet the graph because we ran the PCA separately for each state.

state_pca %>%
    unnest(cols = c(fitted)) %>%
    ggplot(aes(x = .fittedPC1,
               y = .fittedPC2)) +
    geom_point() +
    facet_wrap(~ state) + 
    labs(x = "First Principal Component", 
         y = "Second Principal Component") 

It looks like the counties within each state cluster very strongly on the first component (the x-axis), with a couple of outlying counties in each case. The variation is on the second component (the y-axis). Just for the sake of it, because now I’m curious about what those outlying counties are, let’s redo all the steps from the start, this time also holding on to the county names so we can use them the plot. Here we go, all in one breath from the very beginning:

out <- midwest %>%
    group_by(state) %>%
    select_if(is.numeric) %>%
    select(-PID) %>%
    nest() %>%
    mutate(pca = map(data, do_pca),
           pcs = map(pca, do_tidy),
           fitted = map(pca, do_aug)) %>%
    unnest(cols = c(fitted)) %>%
    add_column(county = midwest$county) 


ggplot(data = out, aes(x = .fittedPC1,
               y = .fittedPC2,
               label = county)) +
    geom_text(size = 1.1) +
    labs(x = "First Principal Component", 
         y = "Second Principal Component") +
    theme_minimal() + facet_wrap(~ state, ncol = 2) 

We can do that last add_column(county = midwest$county) step because we know we’ve ended with a tibble where the rows are the same entities (and in the same order) as the original midwest dataset.

You can see that in some cases the big outliers along the x-axis (the first component) are very highly populated counties. E.g. Cook County, IL, is the city of Chicago, Marion County, IN, is Indianapolis, and so on. Meanwhile, on the y-axis (the second cmponent), looking at Illinois we can see DuPage County at one end, a well-to-do exurb of Chicago where Wheaton is located. And at the other end is Alexander County, the southernmost county in Illinois, with a relatively small population (about 8,000 people). Compare the characterizations of Alexander County and DuPage County to get a sense of why the PCA is putting them far apart from one another on the first component.

We can also look at the second and third components. Ideally the interpretation here is something like “Accounting for or excluding or apart from everything that the first component is picking up, how do counties vary or cluster on the next two orthogonal dimensions identified by the PCA?”

ggplot(data = out, aes(x = .fittedPC2,
               y = .fittedPC3,
               label = county)) +
    geom_text(size = 1.1) +
    labs(x = "Second Principal Component", 
         y = "Third Principal Component") +
    theme_minimal() + facet_wrap(~ state, ncol = 2) 

Using map and mutate to recode data

When we were working with the Congressional term data, we wanted to ask some questions about freshman classes of representatives since 1945. To get this right we must first restrict ourselves to people in the dataset with a start date on or after the first Congress we observe, the 79th. Then we need to get each unique individual’s first term. CQ has no term_id so we have to make one. One way to do this is by nesting the data by unique individual (the pid variable) and then adding a column inside each nested person-level data frame giving a per-person sequence of terms served. This way of making a term_id results in a sequence from 1 to n terms for each individual. Note that this isn’t a proper temporal variable as it does not capture gaps or breaks in the sequence of terms, and so on. Given that we only want to look at first terms now, this is acceptable.

We create the term_id sequences per pid like this. First let’s again make a helper function.

make_termid <- function(data){
    data %>%
        mutate(term_id = 1 + congress - first(congress))
}
library(congress)

first_terms <- congress %>%
    filter(position == "U.S. Representative",
           start > "1945-01-01") %>%
    group_by(pid) %>%
    nest() %>%
    mutate(data = map(data, make_termid)) %>%
    unnest(cols = c(data)) %>%
    filter(term_id == 1) %>%
    select(pid, term_id, everything())

first_terms
## # A tibble: 3,093 x 33
## # Groups:   pid [3,009]
##      pid term_id congress last  first middle suffix born       death     
##    <dbl>   <dbl>    <dbl> <chr> <chr> <chr>  <chr>  <date>     <date>    
##  1     1       1       79 Aber… Thom… Gerst… <NA>   1903-05-16 1953-01-23
##  2     2       1       79 Adams Sher… <NA>   <NA>   1899-01-08 1986-10-27
##  3     4       1       79 Allen Asa   Leona… <NA>   1891-01-05 1969-01-05
##  4     5       1       79 Allen Leo   Elwood <NA>   1898-10-05 1973-01-19
##  5     6       1       79 Almo… J.    Linds… Jr.    1898-06-15 1986-04-14
##  6     7       1       79 Ande… Herm… Carl   <NA>   1897-01-27 1978-07-26
##  7     9       1       79 Ande… John  Zuing… <NA>   1904-03-22 1981-02-09
##  8    10       1       79 Andr… Augu… Herman <NA>   1890-10-11 1958-01-14
##  9    13       1       79 Andr… Walt… Gresh… <NA>   1889-07-16 1949-03-05
## 10    14       1       79 Ange… Homer Daniel <NA>   1875-01-12 1968-03-31
## # … with 3,083 more rows, and 24 more variables: sex <chr>,
## #   position <chr>, party <chr>, state <chr>, district <chr>,
## #   start <date>, end <chr>, religion <chr>, race <chr>, education <chr>,
## #   occ1 <chr>, occ2 <chr>, occ3 <chr>, mil1 <chr>, mil2 <chr>,
## #   mil3 <chr>, start_year <date>, end_year <date>, start_age <int>,
## #   poc <chr>, days_old <dbl>, entry_age <int>, end_career <date>,
## #   yr_fac <fct>

Map functions can be written several ways

In the examples above, we wrote a helper function like do_pca() or make_termid() so as to make it clear what was happening when map() took a do_something() function and fed it to each element of our nested data. This isn’t the only way to make use of map(). The “do something” part can be written more than one way. As you become more comfortable with map() you will probably make use of the alternatives, because they are more compact. In effect they amount to writing the “do something” part directly inside the map() function. Here’s an example. Before, we wrote this:

out  <- mw_pca %>%
    mutate(pca = map(data, do_pca))

out
## # A tibble: 5 x 3
## # Groups:   state [5]
##   state            data pca     
##   <chr> <list<df[,24]>> <list>  
## 1 IL         [102 × 24] <prcomp>
## 2 IN          [92 × 24] <prcomp>
## 3 MI          [83 × 24] <prcomp>
## 4 OH          [88 × 24] <prcomp>
## 5 WI          [72 × 24] <prcomp>

This presupposed us writing the do_pca() function as well. But we could also have used R’s formula syntax (using ~) to stuff the contents of do_pca() right in the map() call. This can be worthwhile when the do_something() you want is a function that already exists, as is the case with prcomp.

state_pca <- mw_pca %>% 
    mutate(pca = map(data, ~ prcomp(.x, center = TRUE, scale = TRUE)))

state_pca
## # A tibble: 5 x 3
## # Groups:   state [5]
##   state            data pca     
##   <chr> <list<df[,24]>> <list>  
## 1 IL         [102 × 24] <prcomp>
## 2 IN          [92 × 24] <prcomp>
## 3 MI          [83 × 24] <prcomp>
## 4 OH          [88 × 24] <prcomp>
## 5 WI          [72 × 24] <prcomp>

Notice the use of .x here. A function is “pipeable” if it obeys a few rules about its arguments and its output, notably that it accepts a data argument named .x as its first argument, and produces tidy output named .x on the other side so that it can be handed off down the pipeline. The prcomp() function is not pipeable in this way, as it’s part of base R. It doesn’t know it’s in a tidy pipeline and needs to be told what the data frame it’s going to be using is called — hence, .x.