class: center middle main-title section-title-1 # .kjh-yellow[Show the]<br /> .kjh-lblue[Right Numbers] <br /> .kjh-yellow[with `dplyr`] .class-info[ **Week 05** .light[Kieran Healy<br> Duke University, Spring 2023] ] --- layout: true class: title title-1 --- # Load our libraries .SMALL[ ```r library(here) # manage file paths library(socviz) # data and some useful functions library(tidyverse) # your friend and mine ``` ] --- # Tidyverse components, again .pull-left.w45[ - .kjh-green[**`library`**]`(tidyverse)` - `Loading tidyverse: ggplot2` - `Loading tidyverse: tibble` - `Loading tidyverse: tidyr` - `Loading tidyverse: readr` - `Loading tidyverse: purrr` - `Loading tidyverse: dplyr` ] -- .pull-right.w55[ - Call the package and ... - `<|` **Draw graphs** - `<|` **Nicer data tables** - `<|` **Tidy your data** - `<|` **Get data into R** - `<|` **Fancy Iteration** - `<|` **Action verbs for tables** ] --- # Other tidyverse components .top[.pull-left.w15[ - `forcats` - `haven` - `lubridate` - `readxl` - `stringr` - `reprex` ]] -- .top[.pull-right.w85[ - `<|` **Deal with factors** - `<|` **Import Stata, SPSS, etc** - `<|` **Dates, Durations, Times** - `<|` **Import from spreadsheets** - `<|` **Strings and Regular Expressions** - `<|` **Make reproducible examples** ]] -- .left.bottom[.footnote[Not all of these are attached when we do `library(tidyverse)`]] --- layout: false class: main-title main-title-inv center middle .center[] --- class: main-title main-title-inv center middle .center[] --- class: main-title main-title-inv center middle .center[] --- class: main-title main-title-inv center middle .center[] --- layout: true class: title title-1 --- class: center middle main-title section-title-1 # .huge[.kjh-yellow[Feeding data]<br /> .kjh-lblue[to `ggplot`]] --- layout: false class: center middle ## .middle.huge.squish4[.kjh-orange[Transform and summarize first.]<br />.kjh-lblue[Then send your clean tables to ggplot.]] --- class: right bottom main-title section-title-1 ## .huge.right.bottom.squish4.kjh-yellow[Crosstabulation<br />.kjh-lblue[and beyond]] --- layout: true class: title title-1 --- # U.S. General Social Survey data: .kjh-pink[`gss_sm`] ```r gss_sm ``` ``` ## # A tibble: 2,867 × 32 ## year id ballot age childs sibs degree race sex region incom…¹ relig ## <dbl> <dbl> <labe> <dbl> <dbl> <lab> <fct> <fct> <fct> <fct> <fct> <fct> ## 1 2016 1 1 47 3 2 Bache… White Male New E… $17000… None ## 2 2016 2 2 61 0 3 High … White Male New E… $50000… None ## 3 2016 3 3 72 2 3 Bache… White Male New E… $75000… Cath… ## 4 2016 4 1 43 4 3 High … White Fema… New E… $17000… Cath… ## 5 2016 5 3 55 2 2 Gradu… White Fema… New E… $17000… None ## 6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000… None ## 7 2016 7 1 50 2 2 High … White Male New E… $17000… None ## 8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000… Cath… ## 9 2016 9 1 45 3 5 High … Black Male Middl… $60000… Prot… ## 10 2016 10 3 71 4 1 Junio… White Male Middl… $60000… None ## # … with 2,857 more rows, 20 more variables: marital <fct>, padeg <fct>, ## # madeg <fct>, partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, ## # grass <fct>, zodiac <fct>, pres12 <labelled>, wtssall <dbl>, ## # income_rc <fct>, agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, ## # religion <fct>, bigregion <fct>, partners_rc <fct>, obama <dbl>, and ## # abbreviated variable name ¹income16 ``` - We often want summary tables or graphs of data like this. --- # Two-way tables: Row percents |bigregion |Protestant |Catholic |Jewish |None |Other |Total | |:---------|:----------|:--------|:------|:----|:-----|:-----| |Northeast |32.4 |33.3 |5.5 |23.0 |5.7 |100.0 | |Midwest |47.1 |24.9 |0.4 |22.8 |4.8 |100.0 | |South |62.4 |15.4 |1.1 |16.3 |4.8 |100.0 | |West |37.7 |24.6 |1.6 |28.5 |7.6 |100.0 | --- # Two-way tables: Column percents |bigregion |Protestant |Catholic |Jewish |None |Other | |:---------|:----------|:--------|:------|:-----|:-----| |Northeast |11.5 |25.0 |52.9 |18.1 |17.6 | |Midwest |23.7 |26.5 |5.9 |25.4 |20.8 | |South |47.4 |24.7 |21.6 |27.5 |31.4 | |West |17.4 |23.9 |19.6 |29.1 |30.2 | |Total |100.0 |100.0 |100.0 |100.0 |100.0 | --- # Two-way tables: Full marginals |bigregion |Protestant |Catholic |Jewish |None |Other | |:---------|:----------|:--------|:------|:----|:-----| |Northeast |5.5 |5.7 |0.9 |3.9 |1.0 | |Midwest |11.4 |6.0 |0.1 |5.5 |1.2 | |South |22.8 |5.6 |0.4 |6.0 |1.8 | |West |8.4 |5.4 |0.4 |6.3 |1.7 | --- # .kjh-yellow[dplyr] lets you work with tibbles .pull-left-wide[ - Remember, tibbles are tables of data where the columns can be of different types, such as numeric, logical, character, factor, etc.] -- .pull-left-wide[ - We'll use dplyr to _transform_ and _summarize_ our data. ] -- .pull-left-wide[ - We'll use the pipe operator, .kjh-pink[**`|>`**], to chain together sequences of actions on our tables. ] --- layout: false class: center # .huge.middle.squish4[`dplyr` draws on the logic and language of .kjh-green[database queries], where the focus is on manipulating tables] --- layout: true class: title title-1 --- # Some .kjh-orange[actions] to take on a single table .pull-left.w80[ - .kjh-orange[**Group**] the data at the level we want, such as “_Religion by Region_” or _“Children by School_”. - .kjh-orange[**Subset**] either the rows or columns of or table. - .kjh-orange[**Mutate**] the data. That is, change something at the _current_ level of grouping. Mutating adds new columns to the table, or changes the content of an existing column. It never changes the number of rows. - .kjh-orange[**Summarize**] or aggregate the data. That is, make something new at a _higher_ level of grouping. E.g., calculate means or counts by some grouping variable. This will generally result in a smaller, _summary_ table. ] --- # Each .kjh-orange[action] is implemented by a .kjh-green[function] -- .pull-left-wide[ - **Group** using .kjh-green[**`group_by()`**]. ] -- .pull-left-wide[ - **Subset** has one action for rows and one for columns. We .kjh-green[**`filter()`**] rows and .kjh-green[**`select()`**] columns. ] -- .pull-left-wide[ - **Mutate** tables (i.e. add new columns, or re-make existing ones) using .kjh-green[**`mutate()`**]. ] -- .pull-left-wide[ - **Summarize** tables (i.e. perform aggregating calculations) using .kjh-green[**`summarize()`**]. ] --- class: center middle main-title section-title-1 # .huge[.kjh-lblue[Example:]<br/>.kjh-yellow[The GSS]] --- # U.S. General Social Survey data: .kjh-pink[`gss_sm`] ```r gss_sm ``` ``` ## # A tibble: 2,867 × 32 ## year id ballot age childs sibs degree race sex region incom…¹ relig ## <dbl> <dbl> <labe> <dbl> <dbl> <lab> <fct> <fct> <fct> <fct> <fct> <fct> ## 1 2016 1 1 47 3 2 Bache… White Male New E… $17000… None ## 2 2016 2 2 61 0 3 High … White Male New E… $50000… None ## 3 2016 3 3 72 2 3 Bache… White Male New E… $75000… Cath… ## 4 2016 4 1 43 4 3 High … White Fema… New E… $17000… Cath… ## 5 2016 5 3 55 2 2 Gradu… White Fema… New E… $17000… None ## 6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000… None ## 7 2016 7 1 50 2 2 High … White Male New E… $17000… None ## 8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000… Cath… ## 9 2016 9 1 45 3 5 High … Black Male Middl… $60000… Prot… ## 10 2016 10 3 71 4 1 Junio… White Male Middl… $60000… None ## # … with 2,857 more rows, 20 more variables: marital <fct>, padeg <fct>, ## # madeg <fct>, partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, ## # grass <fct>, zodiac <fct>, pres12 <labelled>, wtssall <dbl>, ## # income_rc <fct>, agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, ## # religion <fct>, bigregion <fct>, partners_rc <fct>, obama <dbl>, and ## # abbreviated variable name ¹income16 ``` Notice again how the tibble already tells us a lot. --- # Summarizing a Table - Here's what we're going to do: .center[] --- # Summarizing a Table - We're just taking a look at the relevant columns here. We don't need to narrow it like this to do our summary, though. ```r gss_sm |> select(id, bigregion, religion) ``` ``` ## # A tibble: 2,867 × 3 ## id bigregion religion ## <dbl> <fct> <fct> ## 1 1 Northeast None ## 2 2 Northeast None ## 3 3 Northeast Catholic ## 4 4 Northeast Catholic ## 5 5 Northeast None ## 6 6 Northeast None ## 7 7 Northeast None ## 8 8 Northeast Catholic ## 9 9 Northeast Protestant ## 10 10 Northeast None ## # … with 2,857 more rows ``` --- count: false # Count up by one column or variable .panel1-reveal-onetablevar-auto[ ```r *gss_sm ``` ] .panel2-reveal-onetablevar-auto[ ``` ## # A tibble: 2,867 × 32 ## year id ballot age childs sibs degree race sex region incom…¹ relig ## <dbl> <dbl> <labe> <dbl> <dbl> <lab> <fct> <fct> <fct> <fct> <fct> <fct> ## 1 2016 1 1 47 3 2 Bache… White Male New E… $17000… None ## 2 2016 2 2 61 0 3 High … White Male New E… $50000… None ## 3 2016 3 3 72 2 3 Bache… White Male New E… $75000… Cath… ## 4 2016 4 1 43 4 3 High … White Fema… New E… $17000… Cath… ## 5 2016 5 3 55 2 2 Gradu… White Fema… New E… $17000… None ## 6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000… None ## 7 2016 7 1 50 2 2 High … White Male New E… $17000… None ## 8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000… Cath… ## 9 2016 9 1 45 3 5 High … Black Male Middl… $60000… Prot… ## 10 2016 10 3 71 4 1 Junio… White Male Middl… $60000… None ## # … with 2,857 more rows, 20 more variables: marital <fct>, padeg <fct>, ## # madeg <fct>, partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, ## # grass <fct>, zodiac <fct>, pres12 <labelled>, wtssall <dbl>, ## # income_rc <fct>, agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, ## # religion <fct>, bigregion <fct>, partners_rc <fct>, obama <dbl>, and ## # abbreviated variable name ¹income16 ``` ] --- count: false # Count up by one column or variable .panel1-reveal-onetablevar-auto[ ```r gss_sm %>% * group_by(bigregion) ``` ] .panel2-reveal-onetablevar-auto[ ``` ## # A tibble: 2,867 × 32 ## # Groups: bigregion [4] ## year id ballot age childs sibs degree race sex region incom…¹ relig ## <dbl> <dbl> <labe> <dbl> <dbl> <lab> <fct> <fct> <fct> <fct> <fct> <fct> ## 1 2016 1 1 47 3 2 Bache… White Male New E… $17000… None ## 2 2016 2 2 61 0 3 High … White Male New E… $50000… None ## 3 2016 3 3 72 2 3 Bache… White Male New E… $75000… Cath… ## 4 2016 4 1 43 4 3 High … White Fema… New E… $17000… Cath… ## 5 2016 5 3 55 2 2 Gradu… White Fema… New E… $17000… None ## 6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000… None ## 7 2016 7 1 50 2 2 High … White Male New E… $17000… None ## 8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000… Cath… ## 9 2016 9 1 45 3 5 High … Black Male Middl… $60000… Prot… ## 10 2016 10 3 71 4 1 Junio… White Male Middl… $60000… None ## # … with 2,857 more rows, 20 more variables: marital <fct>, padeg <fct>, ## # madeg <fct>, partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, ## # grass <fct>, zodiac <fct>, pres12 <labelled>, wtssall <dbl>, ## # income_rc <fct>, agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, ## # religion <fct>, bigregion <fct>, partners_rc <fct>, obama <dbl>, and ## # abbreviated variable name ¹income16 ``` ] --- count: false # Count up by one column or variable .panel1-reveal-onetablevar-auto[ ```r gss_sm %>% group_by(bigregion) %>% * summarize(total = n()) ``` ] .panel2-reveal-onetablevar-auto[ ``` ## # A tibble: 4 × 2 ## bigregion total ## <fct> <int> ## 1 Northeast 488 ## 2 Midwest 695 ## 3 South 1052 ## 4 West 632 ``` ] <style> .panel1-reveal-onetablevar-auto { color: black; width: 34.3%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-reveal-onetablevar-auto { color: black; width: 63.7%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-reveal-onetablevar-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> Grouping changes the _logical_ structure of the tibble. It tells you what subsets of rows the next mutate or summary operation will be carried out within. --- count: false # Summarize by region and religion .panel1-reveal-pipe1-auto[ ```r *gss_sm ``` ] .panel2-reveal-pipe1-auto[ ``` ## # A tibble: 2,867 × 32 ## year id ballot age childs sibs degree race sex region incom…¹ relig ## <dbl> <dbl> <labe> <dbl> <dbl> <lab> <fct> <fct> <fct> <fct> <fct> <fct> ## 1 2016 1 1 47 3 2 Bache… White Male New E… $17000… None ## 2 2016 2 2 61 0 3 High … White Male New E… $50000… None ## 3 2016 3 3 72 2 3 Bache… White Male New E… $75000… Cath… ## 4 2016 4 1 43 4 3 High … White Fema… New E… $17000… Cath… ## 5 2016 5 3 55 2 2 Gradu… White Fema… New E… $17000… None ## 6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000… None ## 7 2016 7 1 50 2 2 High … White Male New E… $17000… None ## 8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000… Cath… ## 9 2016 9 1 45 3 5 High … Black Male Middl… $60000… Prot… ## 10 2016 10 3 71 4 1 Junio… White Male Middl… $60000… None ## # … with 2,857 more rows, 20 more variables: marital <fct>, padeg <fct>, ## # madeg <fct>, partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, ## # grass <fct>, zodiac <fct>, pres12 <labelled>, wtssall <dbl>, ## # income_rc <fct>, agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, ## # religion <fct>, bigregion <fct>, partners_rc <fct>, obama <dbl>, and ## # abbreviated variable name ¹income16 ``` ] --- count: false # Summarize by region and religion .panel1-reveal-pipe1-auto[ ```r gss_sm %>% * group_by(bigregion, religion) ``` ] .panel2-reveal-pipe1-auto[ ``` ## # A tibble: 2,867 × 32 ## # Groups: bigregion, religion [24] ## year id ballot age childs sibs degree race sex region incom…¹ relig ## <dbl> <dbl> <labe> <dbl> <dbl> <lab> <fct> <fct> <fct> <fct> <fct> <fct> ## 1 2016 1 1 47 3 2 Bache… White Male New E… $17000… None ## 2 2016 2 2 61 0 3 High … White Male New E… $50000… None ## 3 2016 3 3 72 2 3 Bache… White Male New E… $75000… Cath… ## 4 2016 4 1 43 4 3 High … White Fema… New E… $17000… Cath… ## 5 2016 5 3 55 2 2 Gradu… White Fema… New E… $17000… None ## 6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000… None ## 7 2016 7 1 50 2 2 High … White Male New E… $17000… None ## 8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000… Cath… ## 9 2016 9 1 45 3 5 High … Black Male Middl… $60000… Prot… ## 10 2016 10 3 71 4 1 Junio… White Male Middl… $60000… None ## # … with 2,857 more rows, 20 more variables: marital <fct>, padeg <fct>, ## # madeg <fct>, partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, ## # grass <fct>, zodiac <fct>, pres12 <labelled>, wtssall <dbl>, ## # income_rc <fct>, agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, ## # religion <fct>, bigregion <fct>, partners_rc <fct>, obama <dbl>, and ## # abbreviated variable name ¹income16 ``` ] --- count: false # Summarize by region and religion .panel1-reveal-pipe1-auto[ ```r gss_sm %>% group_by(bigregion, religion) %>% * summarize(total = n()) ``` ] .panel2-reveal-pipe1-auto[ ``` ## # A tibble: 24 × 3 ## # Groups: bigregion [4] ## bigregion religion total ## <fct> <fct> <int> ## 1 Northeast Protestant 158 ## 2 Northeast Catholic 162 ## 3 Northeast Jewish 27 ## 4 Northeast None 112 ## 5 Northeast Other 28 ## 6 Northeast <NA> 1 ## 7 Midwest Protestant 325 ## 8 Midwest Catholic 172 ## 9 Midwest Jewish 3 ## 10 Midwest None 157 ## # … with 14 more rows ``` ] --- count: false # Summarize by region and religion .panel1-reveal-pipe1-auto[ ```r gss_sm %>% group_by(bigregion, religion) %>% summarize(total = n()) %>% * mutate(freq = total / sum(total), * pct = round((freq*100), 1)) ``` ] .panel2-reveal-pipe1-auto[ ``` ## # A tibble: 24 × 5 ## # Groups: bigregion [4] ## bigregion religion total freq pct ## <fct> <fct> <int> <dbl> <dbl> ## 1 Northeast Protestant 158 0.324 32.4 ## 2 Northeast Catholic 162 0.332 33.2 ## 3 Northeast Jewish 27 0.0553 5.5 ## 4 Northeast None 112 0.230 23 ## 5 Northeast Other 28 0.0574 5.7 ## 6 Northeast <NA> 1 0.00205 0.2 ## 7 Midwest Protestant 325 0.468 46.8 ## 8 Midwest Catholic 172 0.247 24.7 ## 9 Midwest Jewish 3 0.00432 0.4 ## 10 Midwest None 157 0.226 22.6 ## # … with 14 more rows ``` ] --- count: false # Summarize by region and religion .panel1-reveal-pipe1-auto[ ```r gss_sm %>% group_by(bigregion, religion) %>% summarize(total = n()) %>% mutate(freq = total / sum(total), pct = round((freq*100), 1)) ``` ] .panel2-reveal-pipe1-auto[ ``` ## # A tibble: 24 × 5 ## # Groups: bigregion [4] ## bigregion religion total freq pct ## <fct> <fct> <int> <dbl> <dbl> ## 1 Northeast Protestant 158 0.324 32.4 ## 2 Northeast Catholic 162 0.332 33.2 ## 3 Northeast Jewish 27 0.0553 5.5 ## 4 Northeast None 112 0.230 23 ## 5 Northeast Other 28 0.0574 5.7 ## 6 Northeast <NA> 1 0.00205 0.2 ## 7 Midwest Protestant 325 0.468 46.8 ## 8 Midwest Catholic 172 0.247 24.7 ## 9 Midwest Jewish 3 0.00432 0.4 ## 10 Midwest None 157 0.226 22.6 ## # … with 14 more rows ``` ] <style> .panel1-reveal-pipe1-auto { color: black; width: 34.3%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-reveal-pipe1-auto { color: black; width: 63.7%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-reveal-pipe1-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- # Pipelines carry assumptions forward .small[ ```r gss_sm |> * group_by(bigregion, religion) |> summarize(total = n()) |> mutate(freq = total / sum(total), pct = round((freq*100), 1)) ``` ``` ## # A tibble: 24 × 5 ## # Groups: bigregion [4] ## bigregion religion total freq pct ## <fct> <fct> <int> <dbl> <dbl> ## 1 Northeast Protestant 158 0.324 32.4 ## 2 Northeast Catholic 162 0.332 33.2 ## 3 Northeast Jewish 27 0.0553 5.5 ## 4 Northeast None 112 0.230 23 ## 5 Northeast Other 28 0.0574 5.7 ## 6 Northeast <NA> 1 0.00205 0.2 ## 7 Midwest Protestant 325 0.468 46.8 ## 8 Midwest Catholic 172 0.247 24.7 ## 9 Midwest Jewish 3 0.00432 0.4 ## 10 Midwest None 157 0.226 22.6 ## # … with 14 more rows ``` ] Groups are carried forward till summarized over, or explicitly ungrouped with .kjh-green[`ungroup()`]. -- Summary calculations are done on the innermost group, which then "disappears". (Notice how .kjh-orange[`religion`] is no longer a group in the output, but .kjh-orange[`bigregion`] is.) --- # Pipelines carry assumptions forward .small[ ```r gss_sm |> group_by(bigregion, religion) |> summarize(total = n()) |> mutate(freq = total / sum(total), * pct = round((freq*100), 1)) ``` ``` ## # A tibble: 24 × 5 ## # Groups: bigregion [4] ## bigregion religion total freq pct ## <fct> <fct> <int> <dbl> <dbl> ## 1 Northeast Protestant 158 0.324 32.4 ## 2 Northeast Catholic 162 0.332 33.2 ## 3 Northeast Jewish 27 0.0553 5.5 ## 4 Northeast None 112 0.230 23 ## 5 Northeast Other 28 0.0574 5.7 ## 6 Northeast <NA> 1 0.00205 0.2 ## 7 Midwest Protestant 325 0.468 46.8 ## 8 Midwest Catholic 172 0.247 24.7 ## 9 Midwest Jewish 3 0.00432 0.4 ## 10 Midwest None 157 0.226 22.6 ## # … with 14 more rows ``` ] .kjh-green[**`mutate()`**] is quite clever. See how we can immediately use .kjh-orange[**`freq`**] to calculate .kjh-orange[**`pct`**], even though we are creating them both in the same .kjh-green[**`mutate()`**] expression. --- # Convenience functions .small[ ```r gss_sm |> * group_by(bigregion, religion) |> * summarize(total = n()) |> mutate(freq = total / sum(total), pct = round((freq*100), 1)) ``` ``` ## # A tibble: 24 × 5 ## # Groups: bigregion [4] ## bigregion religion total freq pct ## <fct> <fct> <int> <dbl> <dbl> ## 1 Northeast Protestant 158 0.324 32.4 ## 2 Northeast Catholic 162 0.332 33.2 ## 3 Northeast Jewish 27 0.0553 5.5 ## 4 Northeast None 112 0.230 23 ## 5 Northeast Other 28 0.0574 5.7 ## 6 Northeast <NA> 1 0.00205 0.2 ## 7 Midwest Protestant 325 0.468 46.8 ## 8 Midwest Catholic 172 0.247 24.7 ## 9 Midwest Jewish 3 0.00432 0.4 ## 10 Midwest None 157 0.226 22.6 ## # … with 14 more rows ``` ] We're going to be doing this .kjh-green[**`group_by()`**] ... .kjh-green[**`n()`**] step a lot. Some shorthand for it would be useful. --- # .kjh-pink[Three options] for counting up rows .pull-left-3[ - .SMALL.squish3[Do it yourself with .kjh-green[**`n()`**]] .SMALL[ ```r gss_sm |> * group_by(bigregion, religion) |> * summarize(n = n()) ``` ``` ## # A tibble: 24 × 3 ## # Groups: bigregion [4] ## bigregion religion n ## <fct> <fct> <int> ## 1 Northeast Protestant 158 ## 2 Northeast Catholic 162 ## 3 Northeast Jewish 27 ## 4 Northeast None 112 ## 5 Northeast Other 28 ## 6 Northeast <NA> 1 ## 7 Midwest Protestant 325 ## 8 Midwest Catholic 172 ## 9 Midwest Jewish 3 ## 10 Midwest None 157 ## # … with 14 more rows ``` ] - .small.squish3[_The result is **grouped**._] ] -- .pull-middle-3[ - .SMALL.squish3[Use .kjh-green[**`tally()`**]] .SMALL[ ```r gss_sm |> group_by(bigregion, religion) |> * tally() ``` ``` ## # A tibble: 24 × 3 ## # Groups: bigregion [4] ## bigregion religion n ## <fct> <fct> <int> ## 1 Northeast Protestant 158 ## 2 Northeast Catholic 162 ## 3 Northeast Jewish 27 ## 4 Northeast None 112 ## 5 Northeast Other 28 ## 6 Northeast <NA> 1 ## 7 Midwest Protestant 325 ## 8 Midwest Catholic 172 ## 9 Midwest Jewish 3 ## 10 Midwest None 157 ## # … with 14 more rows ``` ] - .small.squish3[_The result is **grouped**._] ] -- .pull-right-3[ - .SMALL.squish3[use .kjh-green[**`count()`**]] .SMALL[ ```r gss_sm |> * count(bigregion, religion) ``` ``` ## # A tibble: 24 × 3 ## bigregion religion n ## <fct> <fct> <int> ## 1 Northeast Protestant 158 ## 2 Northeast Catholic 162 ## 3 Northeast Jewish 27 ## 4 Northeast None 112 ## 5 Northeast Other 28 ## 6 Northeast <NA> 1 ## 7 Midwest Protestant 325 ## 8 Midwest Catholic 172 ## 9 Midwest Jewish 3 ## 10 Midwest None 157 ## # … with 14 more rows ``` ] - .small.squish3[_One step; the result is **not grouped**._] ] --- # Pipelined tables can be quickly checked .pull-left[ ```r ## Calculate pct religion within region? rel_by_region <- gss_sm |> count(bigregion, religion) |> mutate(pct = round((n/sum(n))*100, 1)) rel_by_region ``` ``` ## # A tibble: 24 × 4 ## bigregion religion n pct ## <fct> <fct> <int> <dbl> ## 1 Northeast Protestant 158 5.5 ## 2 Northeast Catholic 162 5.7 ## 3 Northeast Jewish 27 0.9 ## 4 Northeast None 112 3.9 ## 5 Northeast Other 28 1 ## 6 Northeast <NA> 1 0 ## 7 Midwest Protestant 325 11.3 ## 8 Midwest Catholic 172 6 ## 9 Midwest Jewish 3 0.1 ## 10 Midwest None 157 5.5 ## # … with 14 more rows ``` Hm, did I sum over right group? ] -- .pull-right[ ```r ## Each region should sum to ~100 rel_by_region |> group_by(bigregion) |> summarize(total = sum(pct)) ``` ``` ## # A tibble: 4 × 2 ## bigregion total ## <fct> <dbl> ## 1 Northeast 17 ## 2 Midwest 24.3 ## 3 South 36.7 ## 4 West 22 ``` No! What has gone wrong here? ] --- # Pipelined tables can be quickly checked .pull-left[ ```r rel_by_region <- gss_sm |> * count(bigregion, religion) |> mutate(pct = round((n/sum(n))*100, 1)) ``` .SMALL.squish3[.kjh-green[**`count()`**] returns ungrouped results, so there are no groups carry forward to the .kjh-green[**`mutate()`**] step.] ```r rel_by_region |> summarize(total = sum(pct)) ``` ``` ## # A tibble: 1 × 1 ## total ## <dbl> ## 1 100 ``` .SMALL.squish3[With .kjh-green[**`count()`**], the .kjh-orange[**`pct`**] values in this case are the marginals for the whole table.] ] -- .pull-right[ ```r rel_by_region <- gss_sm |> * group_by(bigregion, religion) |> * tally() |> mutate(pct = round((n/sum(n))*100, 1)) ``` ```r # Check rel_by_region |> group_by(bigregion) |> summarize(total = sum(pct)) ``` ``` ## # A tibble: 4 × 2 ## bigregion total ## <fct> <dbl> ## 1 Northeast 100 ## 2 Midwest 99.9 ## 3 South 100 ## 4 West 100. ``` .SMALL.squish3[.kjh-green[**`group_by()`**] and .kjh-green[**`tally()`**] both return a grouped result. We get some rounding error because we used .kjh-green[**`round()`**] after summing originally.] ] --- # Two lessons ## 1: Check your tables! - Pipelines feed their content forward, so you need to make sure your results are not incorrect. -- - Often, complex tables and graphs can be disturbingly plausible even when wrong. -- - So, figure out what the result should be and test it! -- - Starting with simple or toy cases can help with this process. --- # Two lessons ## 2: Inspect your pipes! - Understand pipelines by running them forward or peeling them back a step at a time. - This is a _very_ effective way to understand your own and other people's code. --- # Pass your pipeline on to a .kjh-yellow[table] ```r gss_sm |> count(bigregion, religion) |> * pivot_wider(names_from = bigregion, values_from = n) |> kable() ``` .small[ |religion | Northeast| Midwest| South| West| |:----------|---------:|-------:|-----:|----:| |Protestant | 158| 325| 650| 238| |Catholic | 162| 172| 160| 155| |Jewish | 27| 3| 11| 10| |None | 112| 157| 170| 180| |Other | 28| 33| 50| 48| |NA | 1| 5| 11| 1| ] --- # Pass your pipeline on to a .kjh-yellow[graph] .SMALL[ ```r gss_sm |> group_by(bigregion, religion) |> tally() |> mutate(pct = round((n/sum(n))*100, 1)) |> drop_na() |> * ggplot(mapping = aes(x = pct, y = reorder(religion, -pct), fill = religion)) + * geom_col() + labs(x = "Percent", y = NULL) + guides(fill = "none") + facet_wrap(~ bigregion, nrow = 1) ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-23-1.png" width="1080" style="display: block; margin: auto;" /> ] --- layout: false class: center middle ## .middle.huge.squish4[.kjh-orange[Use `dplyr` pipelines to create summary tables.]<br />.kjh-lblue[Then send your clean tables to `ggplot`.]] --- class: right bottom main-title section-title-1 ## .huge.right.bottom.squish4[.kjh-lblue[Facets] .kjh-yellow[are often<br />better than] .kjh-lblue[Guides]] --- layout: true class: title title-1 --- # Let's put that table in an object ```r rel_by_region <- gss_sm |> group_by(bigregion, religion) |> tally() |> mutate(pct = round((n/sum(n))*100, 1)) |> drop_na() head(rel_by_region) ``` ``` ## # A tibble: 6 × 4 ## # Groups: bigregion [2] ## bigregion religion n pct ## <fct> <fct> <int> <dbl> ## 1 Northeast Protestant 158 32.4 ## 2 Northeast Catholic 162 33.2 ## 3 Northeast Jewish 27 5.5 ## 4 Northeast None 112 23 ## 5 Northeast Other 28 5.7 ## 6 Midwest Protestant 325 46.8 ``` --- # We might write ... ```r p <- ggplot(data = rel_by_region, mapping = aes(x = bigregion, y = pct, fill = religion)) p_out <- p + geom_col(position = "dodge") + labs(x = "Region", y = "Percent", fill = "Religion") ``` --- # We might write ... <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-26-1.png" width="864" style="display: block; margin: auto;" /> --- # Is this an effective graph? .kjh-red[Not really!] <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-27-1.png" width="864" style="display: block; margin: auto;" /> --- # Try .kjh-lblue[faceting] instead ```r p <- ggplot(data = rel_by_region, * mapping = aes(x = pct, * y = reorder(religion, -pct), fill = religion)) p_out_facet <- p + geom_col() + guides(fill = "none") + facet_wrap(~ bigregion, nrow = 1) + labs(x = "Percent", y = NULL) ``` - Putting categories on the y-axis is a very useful trick. - Faceting reduces the number of guides the viewer needs to consult. --- # Try .kjh-lblue[faceting] instead <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-29-1.png" width="1080" style="display: block; margin: auto;" /> ### .kjh-green[Try putting categories on the y-axis. (And reorder them by x.)] ### .kjh-lblue[Try faceting variables instead of mapping them to color or shape.] ### .kjh-pink[Try to minimize the need for guides and legends.] --- class: right bottom main-title section-title-1 ## .huge.right.bottom.squish4[.kjh-yellow[Two kinds of] .kjh-lblue[facet]] --- # Facet Children vs Age, by Race ```r p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs)) p + geom_point(alpha = 0.2) + geom_smooth() + facet_wrap(~ race) ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-30-1.png" width="720" style="display: block; margin: auto;" /> --- # We can facet by more than one variable ```r p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs)) p + geom_point(alpha = 0.2) + geom_smooth() + * facet_wrap(~ sex + race) ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-31-1.png" width="576" style="display: block; margin: auto;" /> --- # We can arrange .kjh-green[facet_wrap()] quite freely ```r p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs)) p + geom_point(alpha = 0.2) + geom_smooth() + * facet_wrap(~ sex + race, nrow = 1) ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-32-1.png" width="1080" style="display: block; margin: auto;" /> --- # .kjh-green[facet_grid()] is more like a true crosstab ```r p + geom_point(alpha = 0.2) + geom_smooth() + * facet_grid(sex ~ race) ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-33-1.png" width="792" style="display: block; margin: auto;" /> --- # Both can be exended to multi-way views ```r p_out <- p + geom_point(alpha = 0.2) + geom_smooth() + * facet_grid(bigregion ~ race + sex) ``` --- layout: false <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-35-1.png" width="864" style="display: block; margin: auto;" /> --- class: center middle main-title section-title-1 # .huge[.kjh-lblue[What we've] .kjh-yellow[built-up]] --- layout: true class: title title-1 --- # Core Grammar .center[] --- # Grouped data; faceting - Along with a few peeks at scale transformations, guide adjustments, and theme adjustment .center[] --- # .kjh-lblue[`dplyr`] and Pipelining ### The elements of filtering and summarizing ```r gss_sm |> group_by(bigregion, religion) |> tally() |> mutate(freq = n / sum(n), pct = round((freq*100), 1)) ``` ``` ## # A tibble: 24 × 5 ## # Groups: bigregion [4] ## bigregion religion n freq pct ## <fct> <fct> <int> <dbl> <dbl> ## 1 Northeast Protestant 158 0.324 32.4 ## 2 Northeast Catholic 162 0.332 33.2 ## 3 Northeast Jewish 27 0.0553 5.5 ## 4 Northeast None 112 0.230 23 ## 5 Northeast Other 28 0.0574 5.7 ## 6 Northeast <NA> 1 0.00205 0.2 ## 7 Midwest Protestant 325 0.468 46.8 ## 8 Midwest Catholic 172 0.247 24.7 ## 9 Midwest Jewish 3 0.00432 0.4 ## 10 Midwest None 157 0.226 22.6 ## # … with 14 more rows ``` --- class: right bottom main-title section-title-1 ## .huge.right.bottom.squish4[.kjh-yellow[Some data on]<br />.kjh-lblue[Organ Donation]] --- # .kjh-pink[`organdata`] is in the .kjh-lblue[`socviz`] package ```r organdata ``` ``` ## # A tibble: 238 × 21 ## country year donors pop pop_d…¹ gdp gdp_lag health healt…² pubhe…³ ## <chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl> <dbl> ## 1 Austral… NA NA 17065 0.220 16774 16591 1300 1224 4.8 ## 2 Austral… 1991-01-01 12.1 17284 0.223 17171 16774 1379 1300 5.4 ## 3 Austral… 1992-01-01 12.4 17495 0.226 17914 17171 1455 1379 5.4 ## 4 Austral… 1993-01-01 12.5 17667 0.228 18883 17914 1540 1455 5.4 ## 5 Austral… 1994-01-01 10.2 17855 0.231 19849 18883 1626 1540 5.4 ## 6 Austral… 1995-01-01 10.2 18072 0.233 21079 19849 1737 1626 5.5 ## 7 Austral… 1996-01-01 10.6 18311 0.237 21923 21079 1846 1737 5.6 ## 8 Austral… 1997-01-01 10.3 18518 0.239 22961 21923 1948 1846 5.7 ## 9 Austral… 1998-01-01 10.5 18711 0.242 24148 22961 2077 1948 5.9 ## 10 Austral… 1999-01-01 8.67 18926 0.244 25445 24148 2231 2077 6.1 ## # … with 228 more rows, 11 more variables: roads <dbl>, cerebvas <int>, ## # assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>, ## # consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>, ## # and abbreviated variable names ¹pop_dens, ²health_lag, ³pubhealth ``` --- # First looks ```r p <- ggplot(data = organdata, mapping = aes(x = year, y = donors)) p + geom_point() ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-38-1.png" width="720" style="display: block; margin: auto;" /> --- # First looks ```r p <- ggplot(data = organdata, mapping = aes(x = year, y = donors)) p + geom_line() ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-39-1.png" width="720" style="display: block; margin: auto;" /> --- # First looks ```r p <- ggplot(data = organdata, mapping = aes(x = year, y = donors)) p + geom_line(aes(group = country)) ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-40-1.png" width="720" style="display: block; margin: auto;" /> --- # First looks ```r p <- ggplot(data = organdata, mapping = aes(x = year, y = donors)) p + geom_line() + facet_wrap(~ country, nrow = 3) ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-41a-1.png" width="1512" style="display: block; margin: auto;" /> --- # First looks ```r p <- ggplot(data = organdata, mapping = aes(x = year, y = donors)) p + geom_line() + facet_wrap(~ reorder(country, donors, na.rm = TRUE), nrow = 3) ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-41b-1.png" width="1512" style="display: block; margin: auto;" /> --- # First looks ```r p <- ggplot(data = organdata, mapping = aes(x = year, y = donors)) p + geom_line() + facet_wrap(~ reorder(country, -donors, na.rm = TRUE), nrow = 3) ``` <img src="05-slides_files/figure-html/05-work-with-dplyr-and-geoms-41c-1.png" width="1512" style="display: block; margin: auto;" /> --- class: right bottom main-title section-title-1 ## .huge.right.bottom.squish4[.kjh-yellow[Summarize better]<br /> .kjh-lblue[with **`dplyr`**]] --- # Summarize a bunch of variables ```r by_country <- organdata |> group_by(consent_law, country) |> summarize(donors_mean= mean(donors, na.rm = TRUE), donors_sd = sd(donors, na.rm = TRUE), gdp_mean = mean(gdp, na.rm = TRUE), health_mean = mean(health, na.rm = TRUE), roads_mean = mean(roads, na.rm = TRUE), cerebvas_mean = mean(cerebvas, na.rm = TRUE)) head(by_country) ``` ``` ## # A tibble: 6 × 8 ## # Groups: consent_law [1] ## consent_law country donors_mean donors_sd gdp_mean healt…¹ roads…² cereb…³ ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Informed Australia 10.6 1.14 22179. 1958. 105. 558. ## 2 Informed Canada 14.0 0.751 23711. 2272. 109. 422. ## 3 Informed Denmark 13.1 1.47 23722. 2054. 102. 641. ## 4 Informed Germany 13.0 0.611 22163. 2349. 113. 707. ## 5 Informed Ireland 19.8 2.48 20824. 1480. 118. 705. ## 6 Informed Netherlands 13.7 1.55 23013. 1993. 76.1 585. ## # … with abbreviated variable names ¹health_mean, ²roads_mean, ³cerebvas_mean ``` - .medium[This works, but there's so much repetition! It's an open invitation to make mistakes copying and pasting.] --- layout: false class: main-title main-title-inv # .middle.squish4.huge[.kjh-lblue[DRY:]] <br /> .middle.squish4.large.kjh-orange[Don't Repeat Yourself] --- layout: true class: title title-1 --- # Use .kjh-green[`across()`] and .kjh-green[`where()`] instead ```r by_country <- organdata |> group_by(consent_law, country) |> * summarize(across(where(is.numeric), list(mean = ~ mean(.x, na.rm = TRUE), sd = ~ sd(.x, na.rm = TRUE)))) head(by_country) ``` ``` ## # A tibble: 6 × 28 ## # Groups: consent_law [1] ## consen…¹ country donor…² donor…³ pop_m…⁴ pop_sd pop_d…⁵ pop_d…⁶ gdp_m…⁷ gdp_sd ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Informed Austra… 10.6 1.14 18318. 831. 0.237 0.0107 22179. 3959. ## 2 Informed Canada 14.0 0.751 29608. 1193. 0.297 0.0120 23711. 3966. ## 3 Informed Denmark 13.1 1.47 5257. 80.6 12.2 0.187 23722. 3896. ## 4 Informed Germany 13.0 0.611 80255. 5158. 22.5 1.44 22163. 2501. ## 5 Informed Ireland 19.8 2.48 3674. 132. 5.23 0.187 20824. 6670. ## 6 Informed Nether… 13.7 1.55 15548. 373. 37.4 0.898 23013. 3770. ## # … with 18 more variables: gdp_lag_mean <dbl>, gdp_lag_sd <dbl>, ## # health_mean <dbl>, health_sd <dbl>, health_lag_mean <dbl>, ## # health_lag_sd <dbl>, pubhealth_mean <dbl>, pubhealth_sd <dbl>, ## # roads_mean <dbl>, roads_sd <dbl>, cerebvas_mean <dbl>, cerebvas_sd <dbl>, ## # assault_mean <dbl>, assault_sd <dbl>, external_mean <dbl>, ## # external_sd <dbl>, txp_pop_mean <dbl>, txp_pop_sd <dbl>, and abbreviated ## # variable names ¹consent_law, ²donors_mean, ³donors_sd, ⁴pop_mean, … ``` --- # Use .kjh-green[`across()`] and .kjh-green[`where()`] instead ```r by_country <- organdata |> group_by(consent_law, country) |> * summarize(across(where(is.numeric), list(mean = ~ mean(.x, na.rm = TRUE), sd = ~ sd(.x, na.rm = TRUE))), * .groups = "drop") head(by_country) ``` ``` ## # A tibble: 6 × 28 ## consen…¹ country donor…² donor…³ pop_m…⁴ pop_sd pop_d…⁵ pop_d…⁶ gdp_m…⁷ gdp_sd ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Informed Austra… 10.6 1.14 18318. 831. 0.237 0.0107 22179. 3959. ## 2 Informed Canada 14.0 0.751 29608. 1193. 0.297 0.0120 23711. 3966. ## 3 Informed Denmark 13.1 1.47 5257. 80.6 12.2 0.187 23722. 3896. ## 4 Informed Germany 13.0 0.611 80255. 5158. 22.5 1.44 22163. 2501. ## 5 Informed Ireland 19.8 2.48 3674. 132. 5.23 0.187 20824. 6670. ## 6 Informed Nether… 13.7 1.55 15548. 373. 37.4 0.898 23013. 3770. ## # … with 18 more variables: gdp_lag_mean <dbl>, gdp_lag_sd <dbl>, ## # health_mean <dbl>, health_sd <dbl>, health_lag_mean <dbl>, ## # health_lag_sd <dbl>, pubhealth_mean <dbl>, pubhealth_sd <dbl>, ## # roads_mean <dbl>, roads_sd <dbl>, cerebvas_mean <dbl>, cerebvas_sd <dbl>, ## # assault_mean <dbl>, assault_sd <dbl>, external_mean <dbl>, ## # external_sd <dbl>, txp_pop_mean <dbl>, txp_pop_sd <dbl>, and abbreviated ## # variable names ¹consent_law, ²donors_mean, ³donors_sd, ⁴pop_mean, … ``` --- # Plot our summary data .pull-left.w45[ ```r by_country |> ggplot(mapping = aes(x = donors_mean, y = reorder(country, donors_mean), color = consent_law)) + geom_point(size=3) + labs(x = "Donor Procurement Rate", y = NULL, color = "Consent Law") ``` ] -- .pull-right.w55[ <img src="05-slides_files/figure-html/codefig-consent1-1.png" width="768" style="display: block; margin: auto;" /> ] --- # What about faceting it instead? .pull-left.w45[ ```r by_country |> ggplot(mapping = aes(x = donors_mean, y = reorder(country, donors_mean), color = consent_law)) + geom_point(size=3) + guides(color = "none") + * facet_wrap(~ consent_law) + labs(x = "Donor Procurement Rate", y = NULL, color = "Consent Law") ``` .pull-left.w80[The problem is that countries can only be in one Consent Law category.] ] -- .pull-right.w55[ <img src="05-slides_files/figure-html/codefig-consent2-1.png" width="768" style="display: block; margin: auto;" /> ] --- # What about faceting it instead? .pull-left.w45[ ```r by_country |> ggplot(mapping = aes(x = donors_mean, y = reorder(country, donors_mean), color = consent_law)) + geom_point(size=3) + guides(color = "none") + * facet_wrap(~ consent_law, ncol = 1) + labs(x = "Donor Procurement Rate", y = NULL, color = "Consent Law") ``` .pull-left.w80[Restricting to one column doesn't fix it.] ] -- .pull-right.w55[ <img src="05-slides_files/figure-html/codefig-consent2a-1.png" width="480" style="display: block; margin: auto;" /> ] --- # Allow the y-scale to vary .pull-left.w45[ ```r by_country |> ggplot(mapping = aes(x = donors_mean, y = reorder(country, donors_mean), color = consent_law)) + geom_point(size=3) + guides(color = "none") + facet_wrap(~ consent_law, ncol = 1, * scales = "free_y") + labs(x = "Donor Procurement Rate", y = NULL, color = "Consent Law") ``` .pull.left.w90[Normally the point of a facet is to preserve comparability between panels by not allowing the scales to vary. But for categorical measures it can be useful to allow this.] ] -- .pull-right.w55[ <img src="05-slides_files/figure-html/codefig-consent3-1.png" width="768" style="display: block; margin: auto;" /> ] --- # Again, these methods are general .pull-left.w50[ ```r by_country |> ggplot(mapping = aes(x = donors_mean, y = reorder(country, donors_mean), color = consent_law)) + * geom_pointrange(mapping = * aes(xmin = donors_mean - donors_sd, * xmax = donors_mean + donors_sd)) + guides(color = "none") + facet_wrap(~ consent_law, ncol = 1, scales = "free_y") + labs(x = "Donor Procurement Rate", y = NULL, color = "Consent Law") ``` ] -- .pull-right.w50[ <img src="05-slides_files/figure-html/codefig-consent4-1.png" width="768" style="display: block; margin: auto;" /> ] --- # .kjh-green[`across()`] and .kjh-green[`where()`] again ```r gss_sm |> select(madeg, padeg) ``` ``` ## # A tibble: 2,867 × 2 ## madeg padeg ## <fct> <fct> ## 1 High School Graduate ## 2 High School Lt High School ## 3 Lt High School High School ## 4 High School <NA> ## 5 High School Bachelor ## 6 High School <NA> ## 7 High School High School ## 8 Lt High School Lt High School ## 9 Lt High School Lt High School ## 10 High School High School ## # … with 2,857 more rows ``` --- # .kjh-green[`across()`] and .kjh-green[`where()`] again ```r gss_sm |> group_by(sex, padeg) |> summarize(across(where(is.numeric), list(mean = ~ mean(.x, na.rm = TRUE), sd = ~ sd(.x, na.rm = TRUE)))) |> select(sex, padeg, contains(c("age", "childs", "sibs"))) ``` ``` ## # A tibble: 12 × 8 ## # Groups: sex [2] ## sex padeg age_mean age_sd childs_mean childs_sd sibs_mean sibs_sd ## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Male Lt High School 57.8 16.8 2.54 2.06 4.86 3.46 ## 2 Male High School 46.7 16.7 1.54 1.52 3.14 2.76 ## 3 Male Junior College 39.9 16.9 1.07 1.44 3.30 2.87 ## 4 Male Bachelor 43.3 14.6 1.27 1.35 2.54 2.41 ## 5 Male Graduate 39.9 14.8 1.01 1.35 2.36 1.88 ## 6 Male <NA> 47.1 17.1 1.75 1.67 3.84 3.21 ## 7 Female Lt High School 58.5 18.0 2.46 1.72 4.74 3.43 ## 8 Female High School 48.5 17.4 1.76 1.48 3.12 2.82 ## 9 Female Junior College 39.2 11.6 1.46 1.43 3.19 2.00 ## 10 Female Bachelor 44.8 15.4 1.32 1.35 2.88 2.62 ## 11 Female Graduate 43.5 13.8 1.42 1.26 2.33 1.50 ## 12 Female <NA> 47.4 17.8 2.08 1.69 4.65 3.93 ``` --- # .kjh-green[`across()`] and .kjh-green[`where()`] again ```r gss_sm |> select(padeg, madeg, contains(c("age", "childs", "sibs"))) |> group_by(padeg, madeg) |> summarize(across(where(is.numeric), list( mean = ~ mean(.x, na.rm = TRUE), sd = ~ sd(.x, na.rm = TRUE) ))) |> drop_na() |> ggplot(mapping = aes(x = childs_mean, xmin = childs_mean - childs_sd, xmax = childs_mean + childs_sd, y = madeg)) + geom_pointrange() + facet_wrap(~ padeg, ncol = 5) ``` <img src="05-slides_files/figure-html/unnamed-chunk-2-1.png" width="864" style="display: block; margin: auto;" /> --- # The .kjh-lblue[`nycdogs`] package .pull-left.w60[ ```r library(nycdogs) nyc_license ``` ``` ## # A tibble: 493,072 × 9 ## anima…¹ anima…² anima…³ breed…⁴ borough zip_c…⁵ license_…⁶ license_…⁷ extra…⁸ ## <chr> <chr> <dbl> <chr> <chr> <int> <date> <date> <dbl> ## 1 Paige F 2014 Pit Bu… Manhat… 10035 2014-09-12 2017-09-12 2016 ## 2 Yogi M 2010 Boxer Bronx 10465 2014-09-12 2017-10-02 2016 ## 3 Ali M 2014 Basenji Manhat… 10013 2014-09-12 2019-09-12 2016 ## 4 Queen F 2013 Akita … Manhat… 10013 2014-09-12 2017-09-12 2016 ## 5 Lola F 2009 Maltese Manhat… 10028 2014-09-12 2017-10-09 2016 ## 6 Ian M 2006 Unknown Manhat… 10013 2014-09-12 2019-10-30 2016 ## 7 Buddy M 2008 Unknown Manhat… 10025 2014-09-12 2017-10-20 2016 ## 8 Chewba… F 2012 Labrad… Manhat… 10013 2014-09-12 2019-10-01 2016 ## 9 Heidi-… F 2007 Dachsh… Brookl… 11215 2014-09-13 2017-04-16 2016 ## 10 Massimo M 2009 Bull D… Brookl… 11201 2014-09-13 2017-09-17 2016 ## # … with 493,062 more rows, and abbreviated variable names ¹animal_name, ## # ²animal_gender, ³animal_birth_year, ⁴breed_rc, ⁵zip_code, ## # ⁶license_issued_date, ⁷license_expired_date, ⁸extract_year ``` ] .pull-right.w40[ .center[] ] --- # Dogs of New York ```r nyc_license ``` ``` ## # A tibble: 493,072 × 9 ## anima…¹ anima…² anima…³ breed…⁴ borough zip_c…⁵ license_…⁶ license_…⁷ extra…⁸ ## <chr> <chr> <dbl> <chr> <chr> <int> <date> <date> <dbl> ## 1 Paige F 2014 Pit Bu… Manhat… 10035 2014-09-12 2017-09-12 2016 ## 2 Yogi M 2010 Boxer Bronx 10465 2014-09-12 2017-10-02 2016 ## 3 Ali M 2014 Basenji Manhat… 10013 2014-09-12 2019-09-12 2016 ## 4 Queen F 2013 Akita … Manhat… 10013 2014-09-12 2017-09-12 2016 ## 5 Lola F 2009 Maltese Manhat… 10028 2014-09-12 2017-10-09 2016 ## 6 Ian M 2006 Unknown Manhat… 10013 2014-09-12 2019-10-30 2016 ## 7 Buddy M 2008 Unknown Manhat… 10025 2014-09-12 2017-10-20 2016 ## 8 Chewba… F 2012 Labrad… Manhat… 10013 2014-09-12 2019-10-01 2016 ## 9 Heidi-… F 2007 Dachsh… Brookl… 11215 2014-09-13 2017-04-16 2016 ## 10 Massimo M 2009 Bull D… Brookl… 11201 2014-09-13 2017-09-17 2016 ## # … with 493,062 more rows, and abbreviated variable names ¹animal_name, ## # ²animal_gender, ³animal_birth_year, ⁴breed_rc, ⁵zip_code, ## # ⁶license_issued_date, ⁷license_expired_date, ⁸extract_year ``` --- # .kjh-green[`arrange()`] and .kjh-green[`slice()`] ```r nyc_license |> group_by(borough) |> tally() ``` ``` ## # A tibble: 6 × 2 ## borough n ## <chr> <int> ## 1 Bronx 51028 ## 2 Brooklyn 125720 ## 3 Manhattan 166849 ## 4 Queens 101524 ## 5 Staten Island 43236 ## 6 <NA> 4715 ``` --- # .kjh-green[`arrange()`] and .kjh-green[`slice()`] ```r nyc_license |> group_by(borough) |> tally() |> arrange(n) ``` ``` ## # A tibble: 6 × 2 ## borough n ## <chr> <int> ## 1 <NA> 4715 ## 2 Staten Island 43236 ## 3 Bronx 51028 ## 4 Queens 101524 ## 5 Brooklyn 125720 ## 6 Manhattan 166849 ``` --- # .kjh-green[`arrange()`] and .kjh-green[`slice()`] ```r nyc_license |> group_by(borough) |> tally() |> arrange(desc(n)) ``` ``` ## # A tibble: 6 × 2 ## borough n ## <chr> <int> ## 1 Manhattan 166849 ## 2 Brooklyn 125720 ## 3 Queens 101524 ## 4 Bronx 51028 ## 5 Staten Island 43236 ## 6 <NA> 4715 ``` --- # .kjh-green[`arrange()`] and .kjh-green[`slice()`] ```r nyc_license |> group_by(breed_rc) |> tally() ``` ``` ## # A tibble: 327 × 2 ## breed_rc n ## <chr> <int> ## 1 Affenpinscher 136 ## 2 Afghan Hound 89 ## 3 Afghan Hound Crossbreed 19 ## 4 Airedale Terrier 227 ## 5 Akita 491 ## 6 Akita Crossbreed 151 ## 7 Alaskan Klee Kai 113 ## 8 Alaskan Malamute 287 ## 9 American Bully 1100 ## 10 American English Coonhound 103 ## # … with 317 more rows ``` --- # .kjh-green[`arrange()`] and .kjh-green[`slice()`] ```r nyc_license |> group_by(breed_rc) |> tally() |> arrange(desc(n)) ``` ``` ## # A tibble: 327 × 2 ## breed_rc n ## <chr> <int> ## 1 Unknown 54586 ## 2 Yorkshire Terrier 30379 ## 3 Labrador (or Crossbreed) 28399 ## 4 Shih Tzu 27407 ## 5 Pit Bull (or Mix) 24393 ## 6 Chihuahua 21211 ## 7 Maltese 15701 ## 8 Pomeranian 9287 ## 9 Havanese 8606 ## 10 Shih Tzu Crossbreed 8098 ## # … with 317 more rows ``` --- # .kjh-green[`arrange()`] and .kjh-green[`slice()`] ```r nyc_license |> group_by(breed_rc) |> tally() |> slice_max(order_by = n, n = 5) ``` ``` ## # A tibble: 5 × 2 ## breed_rc n ## <chr> <int> ## 1 Unknown 54586 ## 2 Yorkshire Terrier 30379 ## 3 Labrador (or Crossbreed) 28399 ## 4 Shih Tzu 27407 ## 5 Pit Bull (or Mix) 24393 ``` --- # .kjh-green[`arrange()`] and .kjh-green[`slice()`] ```r nyc_license |> group_by(borough, breed_rc) |> drop_na() |> tally() |> slice_max(order_by = n, n = 5) ``` ``` ## # A tibble: 25 × 3 ## # Groups: borough [5] ## borough breed_rc n ## <chr> <chr> <int> ## 1 Bronx Yorkshire Terrier 3583 ## 2 Bronx Pit Bull (or Mix) 3517 ## 3 Bronx Unknown 3484 ## 4 Bronx Shih Tzu 2970 ## 5 Bronx Chihuahua 2224 ## 6 Brooklyn Unknown 9707 ## 7 Brooklyn Yorkshire Terrier 5736 ## 8 Brooklyn Pit Bull (or Mix) 5538 ## 9 Brooklyn Shih Tzu 5281 ## 10 Brooklyn Labrador (or Crossbreed) 5179 ## # … with 15 more rows ``` --- # .kjh-green[`arrange()`] and .kjh-green[`slice()`] ```r nyc_license |> group_by(borough, breed_rc) |> drop_na() |> tally() |> slice_max(order_by = n, prop = 0.05) ``` ``` ## # A tibble: 64 × 3 ## # Groups: borough [5] ## borough breed_rc n ## <chr> <chr> <int> ## 1 Bronx Yorkshire Terrier 3583 ## 2 Bronx Pit Bull (or Mix) 3517 ## 3 Bronx Unknown 3484 ## 4 Bronx Shih Tzu 2970 ## 5 Bronx Chihuahua 2224 ## 6 Bronx Maltese 1382 ## 7 Bronx Labrador (or Crossbreed) 1340 ## 8 Bronx Shih Tzu Crossbreed 819 ## 9 Bronx Pomeranian 667 ## 10 Bronx Chihuahua Crossbreed 638 ## # … with 54 more rows ```