Example 05: Working with `dplyr`

Setup

Code

library(here)      # manage file paths

here() starts at /Users/kjhealy/Documents/courses/vsd

Code

library(socviz)    # data and some useful functions
library(tidyverse) # your friend and mine

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Core `dplyr` verbs

Code

gss_sm

# A tibble: 2,867 × 32
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Select columns

Code

gss_sm |> 
  select(age, degree, bigregion, religion)

# A tibble: 2,867 × 4
     age degree         bigregion religion  
   <dbl> <fct>          <fct>     <fct>     
 1    47 Bachelor       Northeast None      
 2    61 High School    Northeast None      
 3    72 Bachelor       Northeast Catholic  
 4    43 High School    Northeast Catholic  
 5    55 Graduate       Northeast None      
 6    53 Junior College Northeast None      
 7    50 High School    Northeast None      
 8    23 High School    Northeast Catholic  
 9    45 High School    Northeast Protestant
10    71 Junior College Northeast None      
# ℹ 2,857 more rows

Filter rows

Code

gss_sm |> 
  filter(age > 45)

# A tibble: 1,612 × 32
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 5  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 6  2016     7 1             50      2 2      High … White Male  New E… $170000…
 7  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
 8  2016    12 1             86      4 4      High … White Fema… Middl… under $…
 9  2016    14 3             60      5 6      High … Black Fema… Middl… $12500 …
10  2016    15 2             76      7 0      Lt Hi… White Male  New E… $40000 …
# ℹ 1,602 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Code

gss_sm |> 
  filter(childs > 4 & race == "White")

# A tibble: 110 × 32
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016    15 2             76      7 0      Lt Hi… White Male  New E… $40000 …
 2  2016    17 3             56      6 3      High … White Male  New E… $50000 …
 3  2016    26 2             76      8 7      Lt Hi… White Fema… Middl… $5 000 …
 4  2016   142 3             65      5 2      Junio… White Fema… New E… <NA>    
 5  2016   177 1             56      5 3      Bache… White Male  Pacif… $130000…
 6  2016   190 2             51      7 9      Lt Hi… White Fema… Pacif… $15000 …
 7  2016   216 3             77      8 9      High … White Male  Pacif… $60000 …
 8  2016   351 3             52      5 4      High … White Fema… E. No… $35000 …
 9  2016   365 1             51      5 5      Gradu… White Male  South… $170000…
10  2016   379 3             NA      7 2      High … White Male  South… $170000…
# ℹ 100 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Logically Group with `group_by()`

Code

gss_sm |> 
  group_by(bigregion)

# A tibble: 2,867 × 32
# Groups:   bigregion [4]
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Summarize groups with `summarize()`

Code

gss_sm |> 
  group_by(bigregion) |>  
  summarize(total = n())

# A tibble: 4 × 2
  bigregion total
  <fct>     <int>
1 Northeast   488
2 Midwest     695
3 South      1052
4 West        632

Multi-way groupings

Code

gss_sm |>  
  group_by(bigregion, religion) |> 
  summarize(total = n())

`summarise()` has grouped output by 'bigregion'. You can override using the
`.groups` argument.

# A tibble: 24 × 3
# Groups:   bigregion [4]
   bigregion religion   total
   <fct>     <fct>      <int>
 1 Northeast Protestant   158
 2 Northeast Catholic     162
 3 Northeast Jewish        27
 4 Northeast None         112
 5 Northeast Other         28
 6 Northeast <NA>           1
 7 Midwest   Protestant   325
 8 Midwest   Catholic     172
 9 Midwest   Jewish         3
10 Midwest   None         157
# ℹ 14 more rows

Add columns with `mutate()`

Code

gss_sm |>  
  group_by(bigregion, religion) |> 
  summarize(total = n()) |> 
  mutate(freq = total / sum(total),
           pct = round((freq*100), 1))

`summarise()` has grouped output by 'bigregion'. You can override using the
`.groups` argument.

# A tibble: 24 × 5
# Groups:   bigregion [4]
   bigregion religion   total    freq   pct
   <fct>     <fct>      <int>   <dbl> <dbl>
 1 Northeast Protestant   158 0.324    32.4
 2 Northeast Catholic     162 0.332    33.2
 3 Northeast Jewish        27 0.0553    5.5
 4 Northeast None         112 0.230    23  
 5 Northeast Other         28 0.0574    5.7
 6 Northeast <NA>           1 0.00205   0.2
 7 Midwest   Protestant   325 0.468    46.8
 8 Midwest   Catholic     172 0.247    24.7
 9 Midwest   Jewish         3 0.00432   0.4
10 Midwest   None         157 0.226    22.6
# ℹ 14 more rows

Tally and Count

Do it yourself:

Code

gss_sm |> 
  group_by(bigregion, religion) |> 
  summarize(n = n())

`summarise()` has grouped output by 'bigregion'. You can override using the
`.groups` argument.

# A tibble: 24 × 3
# Groups:   bigregion [4]
   bigregion religion       n
   <fct>     <fct>      <int>
 1 Northeast Protestant   158
 2 Northeast Catholic     162
 3 Northeast Jewish        27
 4 Northeast None         112
 5 Northeast Other         28
 6 Northeast <NA>           1
 7 Midwest   Protestant   325
 8 Midwest   Catholic     172
 9 Midwest   Jewish         3
10 Midwest   None         157
# ℹ 14 more rows

Use tally():

Code

gss_sm |> 
  group_by(bigregion, religion) |> 
  tally()

# A tibble: 24 × 3
# Groups:   bigregion [4]
   bigregion religion       n
   <fct>     <fct>      <int>
 1 Northeast Protestant   158
 2 Northeast Catholic     162
 3 Northeast Jewish        27
 4 Northeast None         112
 5 Northeast Other         28
 6 Northeast <NA>           1
 7 Midwest   Protestant   325
 8 Midwest   Catholic     172
 9 Midwest   Jewish         3
10 Midwest   None         157
# ℹ 14 more rows

Use count():

Code

gss_sm |> 
  count(bigregion, religion)

# A tibble: 24 × 3
   bigregion religion       n
   <fct>     <fct>      <int>
 1 Northeast Protestant   158
 2 Northeast Catholic     162
 3 Northeast Jewish        27
 4 Northeast None         112
 5 Northeast Other         28
 6 Northeast <NA>           1
 7 Midwest   Protestant   325
 8 Midwest   Catholic     172
 9 Midwest   Jewish         3
10 Midwest   None         157
# ℹ 14 more rows

Pay attention to how grouping works in these summaries.

Check your work

Code

rel_by_region <- gss_sm |> 
  count(bigregion, religion) |> 
  mutate(pct = round((n/sum(n))*100, 1)) 

rel_by_region

# A tibble: 24 × 4
   bigregion religion       n   pct
   <fct>     <fct>      <int> <dbl>
 1 Northeast Protestant   158   5.5
 2 Northeast Catholic     162   5.7
 3 Northeast Jewish        27   0.9
 4 Northeast None         112   3.9
 5 Northeast Other         28   1  
 6 Northeast <NA>           1   0  
 7 Midwest   Protestant   325  11.3
 8 Midwest   Catholic     172   6  
 9 Midwest   Jewish         3   0.1
10 Midwest   None         157   5.5
# ℹ 14 more rows

Each region should sum to ~100

Code

rel_by_region |> 
  group_by(bigregion) |> 
  summarize(total = sum(pct))

# A tibble: 4 × 2
  bigregion total
  <fct>     <dbl>
1 Northeast  17  
2 Midwest    24.3
3 South      36.7
4 West       22

Grouping has caught us out. Try again.

Code

rel_by_region <- gss_sm |> 
  count(bigregion, religion) |> 
  mutate(pct = round((n/sum(n))*100, 1)) 

rel_by_region

# A tibble: 24 × 4
   bigregion religion       n   pct
   <fct>     <fct>      <int> <dbl>
 1 Northeast Protestant   158   5.5
 2 Northeast Catholic     162   5.7
 3 Northeast Jewish        27   0.9
 4 Northeast None         112   3.9
 5 Northeast Other         28   1  
 6 Northeast <NA>           1   0  
 7 Midwest   Protestant   325  11.3
 8 Midwest   Catholic     172   6  
 9 Midwest   Jewish         3   0.1
10 Midwest   None         157   5.5
# ℹ 14 more rows

Summarize returns one tibble row per group

Code

gss_sm |> 
  group_by(bigregion) |> 
  tally()

# A tibble: 4 × 2
  bigregion     n
  <fct>     <int>
1 Northeast   488
2 Midwest     695
3 South      1052
4 West        632

When you have 2 or n-way groups the calculation is done from the inside out, on the innermost group.

Code

# 4 regions, 6 religion = 24 groups
gss_sm |> 
  group_by(bigregion, religion) |> 
  tally()

# A tibble: 24 × 3
# Groups:   bigregion [4]
   bigregion religion       n
   <fct>     <fct>      <int>
 1 Northeast Protestant   158
 2 Northeast Catholic     162
 3 Northeast Jewish        27
 4 Northeast None         112
 5 Northeast Other         28
 6 Northeast <NA>           1
 7 Midwest   Protestant   325
 8 Midwest   Catholic     172
 9 Midwest   Jewish         3
10 Midwest   None         157
# ℹ 14 more rows

Summarize many variables

The inefficient way:

Code

organdata |>  
  group_by(consent_law, country)  |> 
    summarize(donors_mean= mean(donors, na.rm = TRUE),
              donors_sd = sd(donors, na.rm = TRUE),
              gdp_mean = mean(gdp, na.rm = TRUE),
              gdp_sd = sd(gdp, na.rm = TRUE),
              health_mean = mean(health, na.rm = TRUE),
              roads_mean = mean(roads, na.rm = TRUE),
              cerebvas_mean = mean(cerebvas, na.rm = TRUE))

`summarise()` has grouped output by 'consent_law'. You can override using the
`.groups` argument.

# A tibble: 17 × 9
# Groups:   consent_law [2]
   consent_law country        donors_mean donors_sd gdp_mean gdp_sd health_mean
   <chr>       <chr>                <dbl>     <dbl>    <dbl>  <dbl>       <dbl>
 1 Informed    Australia             10.6     1.14    22179.  3959.       1958.
 2 Informed    Canada                14.0     0.751   23711.  3966.       2272.
 3 Informed    Denmark               13.1     1.47    23722.  3896.       2054.
 4 Informed    Germany               13.0     0.611   22163.  2501.       2349.
 5 Informed    Ireland               19.8     2.48    20824.  6670.       1480.
 6 Informed    Netherlands           13.7     1.55    23013.  3770.       1993.
 7 Informed    United Kingdom        13.5     0.775   21359.  3929.       1561.
 8 Informed    United States         20.0     1.33    29212.  4571.       3988.
 9 Presumed    Austria               23.5     2.42    23876.  3343.       1875.
10 Presumed    Belgium               21.9     1.94    22500.  3171.       1958.
11 Presumed    Finland               18.4     1.53    21019.  3668.       1615.
12 Presumed    France                16.8     1.60    22603.  3260.       2160.
13 Presumed    Italy                 11.1     4.28    21554.  2781.       1757 
14 Presumed    Norway                15.4     1.11    26448.  6492.       2217.
15 Presumed    Spain                 28.1     4.96    16933   2888.       1289.
16 Presumed    Sweden                13.1     1.75    22415.  3213.       1951.
17 Presumed    Switzerland           14.2     1.71    27233   2153.       2776.
# ℹ 2 more variables: roads_mean <dbl>, cerebvas_mean <dbl>

Use `across()` and `where()` instead

Better:

Code

organdata |> 
    group_by(consent_law, country) |>
      summarize(across(where(is.numeric),
                       list(mean = \(x) mean(x, na.rm = TRUE), 
                            sd = \(x) sd(x, na.rm = TRUE))))

`summarise()` has grouped output by 'consent_law'. You can override using the
`.groups` argument.

# A tibble: 17 × 28
# Groups:   consent_law [2]
   consent_law country       donors_mean donors_sd pop_mean pop_sd pop_dens_mean
   <chr>       <chr>               <dbl>     <dbl>    <dbl>  <dbl>         <dbl>
 1 Informed    Australia            10.6     1.14    18318. 8.31e2         0.237
 2 Informed    Canada               14.0     0.751   29608. 1.19e3         0.297
 3 Informed    Denmark              13.1     1.47     5257. 8.06e1        12.2  
 4 Informed    Germany              13.0     0.611   80255. 5.16e3        22.5  
 5 Informed    Ireland              19.8     2.48     3674. 1.32e2         5.23 
 6 Informed    Netherlands          13.7     1.55    15548. 3.73e2        37.4  
 7 Informed    United Kingd…        13.5     0.775   58187. 6.26e2        24.0  
 8 Informed    United States        20.0     1.33   269330. 1.25e4         2.80 
 9 Presumed    Austria              23.5     2.42     7927. 1.09e2         9.45 
10 Presumed    Belgium              21.9     1.94    10153. 1.09e2        30.7  
11 Presumed    Finland              18.4     1.53     5112. 6.86e1         1.51 
12 Presumed    France               16.8     1.60    58056. 8.51e2        10.5  
13 Presumed    Italy                11.1     4.28    57360. 4.25e2        19.0  
14 Presumed    Norway               15.4     1.11     4386. 9.73e1         1.35 
15 Presumed    Spain                28.1     4.96    39666. 9.51e2         7.84 
16 Presumed    Sweden               13.1     1.75     8789. 1.14e2         1.95 
17 Presumed    Switzerland          14.2     1.71     7037. 1.70e2        17.0  
# ℹ 21 more variables: pop_dens_sd <dbl>, gdp_mean <dbl>, gdp_sd <dbl>,
#   gdp_lag_mean <dbl>, gdp_lag_sd <dbl>, health_mean <dbl>, health_sd <dbl>,
#   health_lag_mean <dbl>, health_lag_sd <dbl>, pubhealth_mean <dbl>,
#   pubhealth_sd <dbl>, roads_mean <dbl>, roads_sd <dbl>, cerebvas_mean <dbl>,
#   cerebvas_sd <dbl>, assault_mean <dbl>, assault_sd <dbl>,
#   external_mean <dbl>, external_sd <dbl>, txp_pop_mean <dbl>,
#   txp_pop_sd <dbl>

The \(x) introduces an anonymous function or lambda. The x means “the thing” or “the thing we’re doing something to right now”, and what follows it is some operation we perform on the thing.

Optionally drop any remaning groups:

Code

organdata |> 
    group_by(consent_law, country) |>
      summarize(across(where(is.numeric),
                       list(mean = \(x) mean(x, na.rm = TRUE), 
                            sd = \(x) sd(x, na.rm = TRUE))),
                .groups = "drop")

# A tibble: 17 × 28
   consent_law country       donors_mean donors_sd pop_mean pop_sd pop_dens_mean
   <chr>       <chr>               <dbl>     <dbl>    <dbl>  <dbl>         <dbl>
 1 Informed    Australia            10.6     1.14    18318. 8.31e2         0.237
 2 Informed    Canada               14.0     0.751   29608. 1.19e3         0.297
 3 Informed    Denmark              13.1     1.47     5257. 8.06e1        12.2  
 4 Informed    Germany              13.0     0.611   80255. 5.16e3        22.5  
 5 Informed    Ireland              19.8     2.48     3674. 1.32e2         5.23 
 6 Informed    Netherlands          13.7     1.55    15548. 3.73e2        37.4  
 7 Informed    United Kingd…        13.5     0.775   58187. 6.26e2        24.0  
 8 Informed    United States        20.0     1.33   269330. 1.25e4         2.80 
 9 Presumed    Austria              23.5     2.42     7927. 1.09e2         9.45 
10 Presumed    Belgium              21.9     1.94    10153. 1.09e2        30.7  
11 Presumed    Finland              18.4     1.53     5112. 6.86e1         1.51 
12 Presumed    France               16.8     1.60    58056. 8.51e2        10.5  
13 Presumed    Italy                11.1     4.28    57360. 4.25e2        19.0  
14 Presumed    Norway               15.4     1.11     4386. 9.73e1         1.35 
15 Presumed    Spain                28.1     4.96    39666. 9.51e2         7.84 
16 Presumed    Sweden               13.1     1.75     8789. 1.14e2         1.95 
17 Presumed    Switzerland          14.2     1.71     7037. 1.70e2        17.0  
# ℹ 21 more variables: pop_dens_sd <dbl>, gdp_mean <dbl>, gdp_sd <dbl>,
#   gdp_lag_mean <dbl>, gdp_lag_sd <dbl>, health_mean <dbl>, health_sd <dbl>,
#   health_lag_mean <dbl>, health_lag_sd <dbl>, pubhealth_mean <dbl>,
#   pubhealth_sd <dbl>, roads_mean <dbl>, roads_sd <dbl>, cerebvas_mean <dbl>,
#   cerebvas_sd <dbl>, assault_mean <dbl>, assault_sd <dbl>,
#   external_mean <dbl>, external_sd <dbl>, txp_pop_mean <dbl>,
#   txp_pop_sd <dbl>

The across() function is used inside summarize() and mutate() to do something across some subset of columns.
Inside across(), use where() to choose columns, and then apply a function to each of them.

Code

organdata |>
  mutate(across(where(is.numeric), 
         round))

# A tibble: 238 × 21
   country   year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>     <date>      <dbl> <dbl>    <dbl> <dbl>   <dbl>  <dbl>      <dbl>
 1 Australia NA             NA 17065        0 16774   16591   1300       1224
 2 Australia 1991-01-01     12 17284        0 17171   16774   1379       1300
 3 Australia 1992-01-01     12 17495        0 17914   17171   1455       1379
 4 Australia 1993-01-01     13 17667        0 18883   17914   1540       1455
 5 Australia 1994-01-01     10 17855        0 19849   18883   1626       1540
 6 Australia 1995-01-01     10 18072        0 21079   19849   1737       1626
 7 Australia 1996-01-01     11 18311        0 21923   21079   1846       1737
 8 Australia 1997-01-01     10 18518        0 22961   21923   1948       1846
 9 Australia 1998-01-01     10 18711        0 24148   22961   2077       1948
10 Australia 1999-01-01      9 18926        0 25445   24148   2231       2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <dbl>,
#   assault <dbl>, external <dbl>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

You can also use various “tidy selectors”, like this:

Code

organdata |>
  mutate(across(starts_with("pop"), 
         round))

# A tibble: 238 × 21
   country   year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>     <date>      <dbl> <dbl>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Australia NA          NA    17065        0 16774   16591   1300       1224
 2 Australia 1991-01-01  12.1  17284        0 17171   16774   1379       1300
 3 Australia 1992-01-01  12.4  17495        0 17914   17171   1455       1379
 4 Australia 1993-01-01  12.5  17667        0 18883   17914   1540       1455
 5 Australia 1994-01-01  10.2  17855        0 19849   18883   1626       1540
 6 Australia 1995-01-01  10.2  18072        0 21079   19849   1737       1626
 7 Australia 1996-01-01  10.6  18311        0 21923   21079   1846       1737
 8 Australia 1997-01-01  10.3  18518        0 22961   21923   1948       1846
 9 Australia 1998-01-01  10.5  18711        0 24148   22961   2077       1948
10 Australia 1999-01-01   8.67 18926        0 25445   24148   2231       2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

The function can be a named one, or you can write something yourself:

Code

organdata |>
  mutate(across(starts_with("pop"), 
         \(x) x / 100))

# A tibble: 238 × 21
   country   year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>     <date>      <dbl> <dbl>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Australia NA          NA     171.  0.00220 16774   16591   1300       1224
 2 Australia 1991-01-01  12.1   173.  0.00223 17171   16774   1379       1300
 3 Australia 1992-01-01  12.4   175.  0.00226 17914   17171   1455       1379
 4 Australia 1993-01-01  12.5   177.  0.00228 18883   17914   1540       1455
 5 Australia 1994-01-01  10.2   179.  0.00231 19849   18883   1626       1540
 6 Australia 1995-01-01  10.2   181.  0.00233 21079   19849   1737       1626
 7 Australia 1996-01-01  10.6   183.  0.00237 21923   21079   1846       1737
 8 Australia 1997-01-01  10.3   185.  0.00239 22961   21923   1948       1846
 9 Australia 1998-01-01  10.5   187.  0.00242 24148   22961   2077       1948
10 Australia 1999-01-01   8.67  189.  0.00244 25445   24148   2231       2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

You can use where() with select() as well, when you are just subsetting by column but not yet doing anything across the columns:

Code

organdata |>
  select(where(is.character))

# A tibble: 238 × 7
   country   world   opt   consent_law consent_practice consistent ccode
   <chr>     <chr>   <chr> <chr>       <chr>            <chr>      <chr>
 1 Australia Liberal In    Informed    Informed         Yes        Oz   
 2 Australia Liberal In    Informed    Informed         Yes        Oz   
 3 Australia Liberal In    Informed    Informed         Yes        Oz   
 4 Australia Liberal In    Informed    Informed         Yes        Oz   
 5 Australia Liberal In    Informed    Informed         Yes        Oz   
 6 Australia Liberal In    Informed    Informed         Yes        Oz   
 7 Australia Liberal In    Informed    Informed         Yes        Oz   
 8 Australia Liberal In    Informed    Informed         Yes        Oz   
 9 Australia Liberal In    Informed    Informed         Yes        Oz   
10 Australia Liberal In    Informed    Informed         Yes        Oz   
# ℹ 228 more rows

Code

organdata |>
  select(starts_with("gdp"))

# A tibble: 238 × 2
     gdp gdp_lag
   <int>   <int>
 1 16774   16591
 2 17171   16774
 3 17914   17171
 4 18883   17914
 5 19849   18883
 6 21079   19849
 7 21923   21079
 8 22961   21923
 9 24148   22961
10 25445   24148
# ℹ 228 more rows

Code

organdata |>
  select(contains("health"))

# A tibble: 238 × 3
   health health_lag pubhealth
    <dbl>      <dbl>     <dbl>
 1   1300       1224       4.8
 2   1379       1300       5.4
 3   1455       1379       5.4
 4   1540       1455       5.4
 5   1626       1540       5.4
 6   1737       1626       5.5
 7   1846       1737       5.6
 8   1948       1846       5.7
 9   2077       1948       5.9
10   2231       2077       6.1
# ℹ 228 more rows

Reminder: the `%in%` operator

This is a useful way to restrict selections of either columns, with select(), or especially rows, with (filter):

Code

organdata |> 
  filter(country %in% c("Ireland", "Italy", "Spain"))

# A tibble: 42 × 21
   country year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>   <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Italy   NA           NA   56719     18.8 17430   16525   1397       1274
 2 Italy   1991-01-01    5.2 56751     18.8 18209   17430   1520       1397
 3 Italy   1992-01-01    5.8 56859     18.9 18883   18209   1584       1520
 4 Italy   1993-01-01    6.2 57049     18.9 19124   18883   1554       1584
 5 Italy   1994-01-01    7.9 57204     19.0 19903   19124   1557       1554
 6 Italy   1995-01-01   10.1 57301     19.0 20652   19903   1524       1557
 7 Italy   1996-01-01   11   57397     19.0 21396   20652   1605       1524
 8 Italy   1997-01-01   11.6 57512     19.1 22030   21396   1705       1605
 9 Italy   1998-01-01   12.3 57588     19.1 23291   22030   1800       1705
10 Italy   1999-01-01   13.7 57646     19.1 23729   23291   1853       1800
# ℹ 32 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

All this applies to `mutate()` as well

If you use a function like mean() or sd() or n() with mutate() instead of summarize() it will work too. The difference is that a column will be added with the value repeated for all group members. This can be useful when you want e.g. to make a denominator for some calculation later. Remember, mutate() adds or changes columns but never changes the number of rows in the table, whereas summarize() will usually output a table with fewer rows than the one you give it.

Code

## Country-year data, 238 rows altogether,
## with yearly data for 17 countries.
organdata |>
  select(country, donors)

# A tibble: 238 × 2
   country   donors
   <chr>      <dbl>
 1 Australia  NA   
 2 Australia  12.1 
 3 Australia  12.4 
 4 Australia  12.5 
 5 Australia  10.2 
 6 Australia  10.2 
 7 Australia  10.6 
 8 Australia  10.3 
 9 Australia  10.5 
10 Australia   8.67
# ℹ 228 more rows

Code

## Summarize gets you one row per country
organdata |>
  select(country, donors) |> 
  group_by(country) |> 
  summarize(donors_mean = mean(donors, na.rm = TRUE))

# A tibble: 17 × 2
   country        donors_mean
   <chr>                <dbl>
 1 Australia             10.6
 2 Austria               23.5
 3 Belgium               21.9
 4 Canada                14.0
 5 Denmark               13.1
 6 Finland               18.4
 7 France                16.8
 8 Germany               13.0
 9 Ireland               19.8
10 Italy                 11.1
11 Netherlands           13.7
12 Norway                15.4
13 Spain                 28.1
14 Sweden                13.1
15 Switzerland           14.2
16 United Kingdom        13.5
17 United States         20.0

Code

## Mutate adds each country's donor mean 
## to the 238 observations
tmp <- organdata |>
  select(country, donors) |> 
  group_by(country) |> 
  mutate(donors_mean = mean(donors, na.rm = TRUE))

# First few rows of 238
head(tmp)

# A tibble: 6 × 3
# Groups:   country [1]
  country   donors donors_mean
  <chr>      <dbl>       <dbl>
1 Australia   NA          10.6
2 Australia   12.1        10.6
3 Australia   12.4        10.6
4 Australia   12.5        10.6
5 Australia   10.2        10.6
6 Australia   10.2        10.6

Code

# Last few rows of 238
tail(tmp)

# A tibble: 6 × 3
# Groups:   country [1]
  country       donors donors_mean
  <chr>          <dbl>       <dbl>
1 United States   21          20.0
2 United States   20.9        20.0
3 United States   21.2        20.0
4 United States   21.3        20.0
5 United States   21.5        20.0
6 United States   NA          20.0

Graph your summarized tables

Code

gss_sm |> 
  group_by(bigregion, religion) |> 
  tally() |> 
  mutate(pct = round((n/sum(n))*100, 1)) |> 
  drop_na() |> 
  ggplot(mapping = aes(x = pct, 
                       y = reorder(religion, -pct), fill = religion)) + 
  geom_col() + 
    labs(x = "Percent", y = NULL) +
    guides(fill = "none") + 
    facet_wrap(~ bigregion, nrow = 1)

Code

rel_by_region <- gss_sm |> 
  group_by(bigregion, religion) |> 
  tally() |> 
  mutate(pct = round((n/sum(n))*100, 1)) |> 
  drop_na()


head(rel_by_region)

# A tibble: 6 × 4
# Groups:   bigregion [2]
  bigregion religion       n   pct
  <fct>     <fct>      <int> <dbl>
1 Northeast Protestant   158  32.4
2 Northeast Catholic     162  33.2
3 Northeast Jewish        27   5.5
4 Northeast None         112  23  
5 Northeast Other         28   5.7
6 Midwest   Protestant   325  46.8

Code

p <- ggplot(data = rel_by_region, 
                mapping = aes(x = bigregion, 
                              y = pct, 
                              fill = religion))
p_out <- p + geom_col(position = "dodge") +
    labs(x = "Region",
         y = "Percent", 
         fill = "Religion") 

p_out

Experiment with facets:

Code

p <- ggplot(data = rel_by_region, 
                mapping = aes(x = pct, 
                              y = reorder(religion, -pct), 
                              fill = religion))
p_out_facet <- p + geom_col() +
  guides(fill = "none") + 
  facet_wrap(~ bigregion, nrow = 1) +
  labs(x = "Percent",
       y = NULL) 

p_out_facet

Multi-way facets

Code

p <-  ggplot(data = gss_sm,
             mapping = aes(x = age, y = childs))

p + geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_wrap(~ race)

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 18 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 18 rows containing missing values (`geom_point()`).

Code

p <-  ggplot(data = gss_sm,
             mapping = aes(x = age, y = childs))

p + geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_wrap(~ sex + race)

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 18 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 18 rows containing missing values (`geom_point()`).

Code

p <-  ggplot(data = gss_sm,
             mapping = aes(x = age, y = childs))

p + geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_wrap(~ sex + race, nrow = 1)

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 18 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 18 rows containing missing values (`geom_point()`).

`facet_wrap()` vs `facet_grid()`

Code

p + geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_grid(sex ~ race)

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 18 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 18 rows containing missing values (`geom_point()`).

Code

p_out <- p + geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_grid(bigregion ~ race + sex) 

p_out

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 18 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 18 rows containing missing values (`geom_point()`).

Setup

Core dplyr verbs

Select columns

Filter rows

Logically Group with group_by()

Summarize groups with summarize()

Multi-way groupings

Add columns with mutate()

Tally and Count

Check your work

Summarize returns one tibble row per group

Summarize many variables

Use across() and where() instead

Reminder: the %in% operator

All this applies to mutate() as well

Graph your summarized tables

Multi-way facets

facet_wrap() vs facet_grid()

Core `dplyr` verbs

Logically Group with `group_by()`

Summarize groups with `summarize()`

Add columns with `mutate()`

Use `across()` and `where()` instead

Reminder: the `%in%` operator

All this applies to `mutate()` as well

`facet_wrap()` vs `facet_grid()`