class: center middle main-title section-title-1 # .kjh-green[Iteration], .kjh-yellow[Missingness], and .kjh-lblue[Selection Effects] .class-info[ **Week 10** .light[Kieran Healy<br> Duke University, Spring 2023] ] --- layout: true class: title title-1 --- # Load the packages, as always .SMALL[ ```r library(here) # manage file paths library(socviz) # data and some useful functions library(tidyverse) # your friend and mine ``` ] --- layout: false class: center # .middle.squish4.huge[.kjh-orange[Moar Data]] --- layout: true class: title title-1 --- # More than one data file .SMALL[Say we have a local folder of data named `congress/`] .smaller[ ```r # A little trick from the fs package: fs::dir_tree(here("files", "data", "congress")) ``` ``` ## /Users/kjhealy/Documents/courses/vsd/files/data/congress ## ├── 01_79_congress.csv ## ├── 02_80_congress.csv ## ├── 03_81_congress.csv ## ├── 04_82_congress.csv ## ├── 05_83_congress.csv ## ├── 06_84_congress.csv ## ├── 07_85_congress.csv ## ├── 08_86_congress.csv ## ├── 09_87_congress.csv ## ├── 10_88_congress.csv ## ├── 11_89_congress.csv ## ├── 12_90_congress.csv ## ├── 13_91_congress.csv ## ├── 14_92_congress.csv ## ├── 15_93_congress.csv ## ├── 16_94_congress.csv ## ├── 17_95_congress.csv ## ├── 18_96_congress.csv ## ├── 19_97_congress.csv ## ├── 20_98_congress.csv ## ├── 21_99_congress.csv ## ├── 22_100_congress.csv ## ├── 23_101_congress.csv ## ├── 24_102_congress.csv ## ├── 25_103_congress.csv ## ├── 26_104_congress.csv ## ├── 27_105_congress.csv ## ├── 28_106_congress.csv ## ├── 29_107_congress.csv ## ├── 30_108_congress.csv ## ├── 31_109_congress.csv ## ├── 32_110_congress.csv ## ├── 33_111_congress.csv ## ├── 34_112_congress.csv ## ├── 35_113_congress.csv ## ├── 36_114_congress.csv ## ├── 37_115_congress.csv ## └── 38_116_congress.csv ``` ] --- # More than one data file .SMALL[Let's look at one.] ```r read_csv(here("files", "data", "congress", "17_95_congress.csv")) |> janitor::clean_names() |> head() ``` ``` ## # A tibble: 6 × 25 ## last first middle suffix nickname born death sex position party state ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Abdnor James <NA> <NA> <NA> 02/1… 11/0… M U.S. Re… Repu… SD ## 2 Abourezk James George <NA> <NA> 02/2… <NA> M U.S. Se… Demo… SD ## 3 Adams Brockm… <NA> <NA> Brock 01/1… 09/1… M U.S. Re… Demo… WA ## 4 Addabbo Joseph Patri… <NA> <NA> 03/1… 04/1… M U.S. Re… Demo… NY ## 5 Aiken George David <NA> <NA> 08/2… 11/1… M U.S. Se… Repu… VT ## 6 Akaka Daniel Kahik… <NA> <NA> 09/1… 04/0… M U.S. Re… Demo… HI ## # ℹ 14 more variables: district <chr>, start <chr>, end <chr>, religion <chr>, ## # race <chr>, educational_attainment <chr>, job_type1 <chr>, job_type2 <chr>, ## # job_type3 <chr>, job_type4 <chr>, job_type5 <lgl>, mil1 <chr>, mil2 <chr>, ## # mil3 <chr> ``` We often find ourselves in this situation. We know each file has the same structure, and we would like to use them all at once. --- # Loops? How to read them all in? One traditional way, which we could do in R, is to write an explicit _loop_ that iterated over a vector of filenames, read each file, and then joined the results together in a tall rectangle. ```r # Pseudocode filenames <- c("01_79_congress.csv", "02_80_congress.csv", "03_81_congress.csv", "04_82_congress.csv" [etc etc]) collected_files <- NULL for(i in 1:length(filenames)) { new_file <- read_file(filenames[i]) collected_files <- append_to(collected_files, new_files) } ``` --- # Loops? You may have noticed we have not written any loops, however. While loops are still lurking there underneath the surface, what we will do instead is to take advantage of the combination of vectors and functions and _map_ one to the other in order to generate results. Speaking loosely, think of .kjh-green[**`map()`**] as a way of .kjh-orange[iterating] without writing loops. You start with a vector of things. You feed it one thing at a time to some function. The function does whatever it does. You get back output that is the same length as your input, and of a specific type. --- # Mapping is just a kind of iteration The `purrr` package provides a big family of mapping functions. One reason there are a lot of them is that `purrr`, like the rest of the tidyverse, is picky about data types. -- So in addition to the basic .kjh-green[**`map()`**], which always returns a _list_, we also have .kjh-green[**`map_chr()`**], .kjh-green[**`map_int()`**], .kjh-green[**`map_dbl()`**], .kjh-green[**`map_lgl()`**] and others. They always return the data type indicated by their suffix, or die trying. --- # Vectorized arithmetic again The simplest cases are not that different from the vectorized arithmetic we're already familiar with. ```r a <- c(1:10) b <- 1 # You should know what R will do here a + b ``` ``` ## [1] 2 3 4 5 6 7 8 9 10 11 ``` -- R's vectorized rules add `b` to every element of `a`. In a sense, the .kjh-green[**`+`**] operation can be thought of as a function that takes each element of `a` and does something with it. In this case "add `b`". --- # Vectorized arithmetic again We can make this explicit by writing a function: ```r add_b <- function(x) { b <- 1 x + b # for any x } ``` Now: ```r add_b(x = a) ``` ``` ## [1] 2 3 4 5 6 7 8 9 10 11 ``` --- # Vectorized arithmetic again Again, R's vectorized approach means it automatically adds `b` to every element of the x we give it. ```r add_b(x = 10) ``` ``` ## [1] 11 ``` ```r add_b(x = c(1, 99, 1000)) ``` ``` ## [1] 2 100 1001 ``` --- # .kjh-green[Iterating] in a pipeline Some operations can't directly be vectorized in this way, which is why we need to manually iterate, or will want to write loops. ```r library(gapminder) gapminder |> summarize(country_n = n_distinct(country), continent_n = n_distinct(continent), year_n = n_distinct(year), lifeExp_n = n_distinct(lifeExp), population_n = n_distinct(population)) ``` ``` ## # A tibble: 1 × 5 ## country_n continent_n year_n lifeExp_n population_n ## <int> <int> <int> <int> <int> ## 1 142 5 12 1626 4060 ``` That's tedious to write! Computers are supposed to allow us to avoid that sort of thing. --- # .kjh-green[Iterating] in a pipeline So how would we iterate this? What we want is to apply the .kjh-green[**`n_distinct()`**] function to each column of `gapminder`, but in a way that still allows us to use pipelines and so on. ```r library(gapminder) gapminder |> summarize(n_distinct(country), n_distinct(continent), n_distinct(year), n_distinct(lifeExp), n_distinct(population)) ``` ``` ## # A tibble: 1 × 5 ## `n_distinct(country)` `n_distinct(continent)` `n_distinct(year)` ## <int> <int> <int> ## 1 142 5 12 ## # ℹ 2 more variables: `n_distinct(lifeExp)` <int>, ## # `n_distinct(population)` <int> ``` .smaller.kjh-darkgrey[Using .kjh-green[**`n_distinct()`**] in this context is an idea I got from Rebecca Barter's discussion of `purrr`.] --- # .kjh-green[Iterating] in a pipeline You'd use .kjh-green[**across()**], like this: ```r gapminder |> summarize(across(everything(), n_distinct)) ``` ``` ## # A tibble: 1 × 6 ## country continent year lifeExp pop gdpPercap ## <int> <int> <int> <int> <int> <int> ## 1 142 5 12 1626 1704 1704 ``` --- # .kjh-green[Iterating] in a pipeline But you could also do this ... .pull-left[ ```r map(gapminder, n_distinct) ``` ``` ## $country ## [1] 142 ## ## $continent ## [1] 5 ## ## $year ## [1] 12 ## ## $lifeExp ## [1] 1626 ## ## $pop ## [1] 1704 ## ## $gdpPercap ## [1] 1704 ``` ] .pull-right[ Read it as "Feed each column of `gapminder` to the .kjh-green[**`n_distinct()`**] function. (This is pretty much what .kjh-green[**`across()`**] is doing more nicely.) ] --- # .kjh-green[Iterating] in a pipeline .pull-left[ Or, in pipeline form: ```r gapminder |> map(n_distinct) ``` ``` ## $country ## [1] 142 ## ## $continent ## [1] 5 ## ## $year ## [1] 12 ## ## $lifeExp ## [1] 1626 ## ## $pop ## [1] 1704 ## ## $gdpPercap ## [1] 1704 ``` ] .pull-right[ You can see we are getting a _list_ back. ] --- # .kjh-green[Iterating] in a pipeline Or, in pipeline form: ```r result <- gapminder |> map(n_distinct) class(result) ``` ``` ## [1] "list" ``` ```r result$continent ``` ``` ## [1] 5 ``` ```r result[[2]] ``` ``` ## [1] 5 ``` --- # .kjh-green[Iterating] in a pipeline But we know .kjh-green[**`n_distinct()`**] should always return an integer. So we use .kjh-green[**`map_int()`**] instead of the generic .kjh-green[**`map()`**]. ```r gapminder |> map_int(n_distinct) ``` ``` ## country continent year lifeExp pop gdpPercap ## 142 5 12 1626 1704 1704 ``` The thing about the .kjh-green[**`map()`**] family is that they can deal with all kinds of input types and output types. --- # Get a vector of .kjh-pink[filenames] ```r filenames <- dir(path = here("files", "data", "congress"), pattern = "*.csv", full.names = TRUE) filenames[1:15] # Just displaying the first 15, to save slide space ``` ``` ## [1] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/01_79_congress.csv" ## [2] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/02_80_congress.csv" ## [3] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/03_81_congress.csv" ## [4] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/04_82_congress.csv" ## [5] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/05_83_congress.csv" ## [6] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/06_84_congress.csv" ## [7] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/07_85_congress.csv" ## [8] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/08_86_congress.csv" ## [9] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/09_87_congress.csv" ## [10] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/10_88_congress.csv" ## [11] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/11_89_congress.csv" ## [12] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/12_90_congress.csv" ## [13] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/13_91_congress.csv" ## [14] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/14_92_congress.csv" ## [15] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/15_93_congress.csv" ``` --- # And feed it to .kjh-green[`read_csv()`] .small.squish3[... using the variant of .kjh-green[**`map()`**] that returns data frames and tibbles.] ```r df <- filenames |> * map_dfr(read_csv, .id = "congress") |> janitor::clean_names() df ``` ``` ## # A tibble: 20,580 × 26 ## congress last first middle suffix nickname born death sex position party ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1 Abern… Thom… Gerst… <NA> <NA> 05/1… 01/2… M U.S. Re… Demo… ## 2 1 Adams Sher… <NA> <NA> <NA> 01/0… 10/2… M U.S. Re… Repu… ## 3 1 Aiken Geor… David <NA> <NA> 08/2… 11/1… M U.S. Se… Repu… ## 4 1 Allen Asa Leona… <NA> <NA> 01/0… 01/0… M U.S. Re… Demo… ## 5 1 Allen Leo Elwood <NA> <NA> 10/0… 01/1… M U.S. Re… Repu… ## 6 1 Almond J. Linds… Jr. <NA> 06/1… 04/1… M U.S. Re… Demo… ## 7 1 Ander… Herm… Carl <NA> <NA> 01/2… 07/2… M U.S. Re… Repu… ## 8 1 Ander… Clin… Presba <NA> <NA> 10/2… 11/1… M U.S. Re… Demo… ## 9 1 Ander… John Zuing… <NA> <NA> 03/2… 02/0… M U.S. Re… Repu… ## 10 1 Andre… Augu… Herman <NA> <NA> 10/1… 01/1… M U.S. Re… Repu… ## # ℹ 20,570 more rows ## # ℹ 15 more variables: state <chr>, district <chr>, start <chr>, end <chr>, ## # religion <chr>, race <chr>, educational_attainment <chr>, job_type1 <chr>, ## # job_type2 <chr>, job_type3 <chr>, job_type4 <chr>, job_type5 <chr>, ## # mil1 <chr>, mil2 <chr>, mil3 <chr> ``` --- layout: false class: center .top[] --- layout: true class: title title-1 --- # .kjh-green[`read_csv()`] can do this directly No `map_df()` required ```r tmp <- read_csv(filenames, id = "path", name_repair = janitor::make_clean_names) tmp |> mutate(congress = stringr::str_extract(path, "_\\d{2,3}_congress"), congress = stringr::str_extract(congress, "\\d{2,3}")) |> relocate(congress) ``` ``` ## # A tibble: 20,580 × 27 ## congress path last first middle suffix nickname born death sex position ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 79 /User… Aber… Thom… Gerst… <NA> <NA> 05/1… 01/2… M U.S. Re… ## 2 79 /User… Adams Sher… <NA> <NA> <NA> 01/0… 10/2… M U.S. Re… ## 3 79 /User… Aiken Geor… David <NA> <NA> 08/2… 11/1… M U.S. Se… ## 4 79 /User… Allen Asa Leona… <NA> <NA> 01/0… 01/0… M U.S. Re… ## 5 79 /User… Allen Leo Elwood <NA> <NA> 10/0… 01/1… M U.S. Re… ## 6 79 /User… Almo… J. Linds… Jr. <NA> 06/1… 04/1… M U.S. Re… ## 7 79 /User… Ande… Herm… Carl <NA> <NA> 01/2… 07/2… M U.S. Re… ## 8 79 /User… Ande… Clin… Presba <NA> <NA> 10/2… 11/1… M U.S. Re… ## 9 79 /User… Ande… John Zuing… <NA> <NA> 03/2… 02/0… M U.S. Re… ## 10 79 /User… Andr… Augu… Herman <NA> <NA> 10/1… 01/1… M U.S. Re… ## # ℹ 20,570 more rows ## # ℹ 16 more variables: party <chr>, state <chr>, district <chr>, start <chr>, ## # end <chr>, religion <chr>, race <chr>, educational_attainment <chr>, ## # job_type1 <chr>, job_type2 <chr>, job_type3 <chr>, job_type4 <chr>, ## # job_type5 <chr>, mil1 <chr>, mil2 <chr>, mil3 <chr> ``` --- # Example: Iterating on the .kjh-yellow[US Census] - But mapped iteration is not just for local files! ```r ## Remember to use your free Census API key library(tidycensus) ``` ```r out <- get_acs(geography = "county", variables = "B19013_001", state = "NY", county = "New York", survey = "acs1", year = 2005) ``` ```r out ``` ``` ## # A tibble: 1 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 36061 New York County, New York B19013_001 55973 1462 ``` --- # Example: Iterating on the .kjh-yellow[US Census] - All counties in New York State for a specific year ```r out <- get_acs(geography = "county", variables = "B19013_001", state = "NY", survey = "acs1", year = 2005) ``` ```r out ``` ``` ## # A tibble: 38 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 36001 Albany County, New York B19013_001 50054 2030 ## 2 36005 Bronx County, New York B19013_001 29228 853 ## 3 36007 Broome County, New York B19013_001 36394 2340 ## 4 36009 Cattaraugus County, New York B19013_001 37580 2282 ## 5 36011 Cayuga County, New York B19013_001 42057 2406 ## 6 36013 Chautauqua County, New York B19013_001 35495 2077 ## 7 36015 Chemung County, New York B19013_001 37418 3143 ## 8 36019 Clinton County, New York B19013_001 44757 3500 ## 9 36027 Dutchess County, New York B19013_001 61889 2431 ## 10 36029 Erie County, New York B19013_001 41967 1231 ## # ℹ 28 more rows ``` --- # Example: Iterating on the .kjh-yellow[US Census] - What if we want the results for _every_ available year? - First, a handy function: .kjh-green[**`set_names()`**] ```r x <- c(1:10) x ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 ``` ```r x <- set_names(x, nm = letters[1:10]) x ``` ``` ## a b c d e f g h i j ## 1 2 3 4 5 6 7 8 9 10 ``` --- # Example: Iterating on the .kjh-yellow[US Census] - By default, .kjh-green[**`set_names()`**] will label a vector with that vector’s values: ```r c(1:10) |> set_names() ``` ``` ## 1 2 3 4 5 6 7 8 9 10 ## 1 2 3 4 5 6 7 8 9 10 ``` --- # Example: Iterating on the .kjh-yellow[US Census] - Meanwhile, .kjh-green[**`map_dfr()`**] has an `.id` argument that lets you add a row-identifier to whatever you are binding. Like this: ```r df <- 2005:2019 |> map_dfr(~ get_acs(geography = "county", variables = "B19013_001", state = "NY", county = "New York", survey = "acs1", year = .x), .id = "id") ``` ```r df ``` ``` ## # A tibble: 15 × 6 ## id GEOID NAME variable estimate moe ## <chr> <chr> <chr> <chr> <dbl> <dbl> ## 1 1 36061 New York County, New York B19013_001 55973 1462 ## 2 2 36061 New York County, New York B19013_001 60017 1603 ## 3 3 36061 New York County, New York B19013_001 64217 2002 ## 4 4 36061 New York County, New York B19013_001 69017 1943 ## 5 5 36061 New York County, New York B19013_001 68706 1471 ## 6 6 36061 New York County, New York B19013_001 63832 2125 ## 7 7 36061 New York County, New York B19013_001 66299 1783 ## 8 8 36061 New York County, New York B19013_001 67099 1640 ## 9 9 36061 New York County, New York B19013_001 72190 2200 ## 10 10 36061 New York County, New York B19013_001 76089 2016 ## 11 11 36061 New York County, New York B19013_001 75575 2566 ## 12 12 36061 New York County, New York B19013_001 77559 2469 ## 13 13 36061 New York County, New York B19013_001 85071 3520 ## 14 14 36061 New York County, New York B19013_001 85066 3480 ## 15 15 36061 New York County, New York B19013_001 93651 3322 ``` --- # Example: Iterating on the .kjh-yellow[US Census] - Our `id` column tracks the year. But we’d like it to *be* the year. So, we use .kjh-green[**`set_names()`**], and also name it `year` when we create it: ```r df <- 2005:2019 |> set_names() |> map_dfr(~ get_acs(geography = "county", variables = "B19013_001", state = "NY", county = "New York", survey = "acs1", year = .x), .id = "year") |> mutate(year = as.integer(year)) ``` --- # Example: Iterating on the .kjh-yellow[US Census] ```r df ``` ``` ## # A tibble: 15 × 6 ## year GEOID NAME variable estimate moe ## <int> <chr> <chr> <chr> <dbl> <dbl> ## 1 2005 36061 New York County, New York B19013_001 55973 1462 ## 2 2006 36061 New York County, New York B19013_001 60017 1603 ## 3 2007 36061 New York County, New York B19013_001 64217 2002 ## 4 2008 36061 New York County, New York B19013_001 69017 1943 ## 5 2009 36061 New York County, New York B19013_001 68706 1471 ## 6 2010 36061 New York County, New York B19013_001 63832 2125 ## 7 2011 36061 New York County, New York B19013_001 66299 1783 ## 8 2012 36061 New York County, New York B19013_001 67099 1640 ## 9 2013 36061 New York County, New York B19013_001 72190 2200 ## 10 2014 36061 New York County, New York B19013_001 76089 2016 ## 11 2015 36061 New York County, New York B19013_001 75575 2566 ## 12 2016 36061 New York County, New York B19013_001 77559 2469 ## 13 2017 36061 New York County, New York B19013_001 85071 3520 ## 14 2018 36061 New York County, New York B19013_001 85066 3480 ## 15 2019 36061 New York County, New York B19013_001 93651 3322 ``` - Now `year` is just the year. The `year` column will be created as a character vector, so we converted it back to an integer again at the end. --- # Example: Iterating on the .kjh-yellow[US Census] ```r p_out <- 2005:2019 |> set_names() |> map_dfr(~ get_acs(geography = "county", variables = "B19013_001", state = "NY", survey = "acs1", year = .x), .id = "year") |> mutate(year = as.integer(year)) |> ggplot(mapping = aes(x = year, y = estimate, group = year)) + geom_boxplot(fill = "lightblue", alpha = 0.5, outlier.alpha = 0) + geom_jitter(position = position_jitter(width = 0.1), shape = 1) + scale_y_continuous(labels = scales::label_dollar()) + labs(x = "Year", y = "Dollars", title = "Median Household Income by County in New York State, 2005-2019", subtitle = "ACS 1-year estimates", caption = "Data: U.S. Census Bureau.") ``` --- # Example: Iterating on the .kjh-yellow[US Census] ```r print(p_out) ``` <img src="10-slides_files/figure-html/07-iterating-on-data-32-1.png" width="864" style="display: block; margin: auto;" /> --- layout: true class: title title-1 --- # Cleaning up .kjh-yellow[congress] ```r df <- filenames |> * map_dfr(read_csv, .id = "congress") |> janitor::clean_names() df |> select(born, death, start, end) ``` ``` ## # A tibble: 20,580 × 4 ## born death start end ## <chr> <chr> <chr> <chr> ## 1 05/16/1903 01/23/1953 01/03/1945 01/03/1953 ## 2 01/08/1899 10/27/1986 01/03/1945 01/03/1947 ## 3 08/20/1892 11/19/1984 01/03/1945 01/03/1979 ## 4 01/05/1891 01/05/1969 01/03/1945 01/03/1953 ## 5 10/05/1898 01/19/1973 01/03/1945 01/02/1949 ## 6 06/15/1898 04/14/1986 02/04/1946 04/17/1948 ## 7 01/27/1897 07/26/1978 01/03/1945 01/03/1963 ## 8 10/23/1895 11/11/1975 01/03/1941 06/30/1945 ## 9 03/22/1904 02/09/1981 01/03/1945 01/03/1953 ## 10 10/11/1890 01/14/1958 01/03/1945 01/14/1958 ## # ℹ 20,570 more rows ``` We'll use the **lubridate** package to sort these out. Lubridate has a wide range of functions to handle dates, times, and durations. ??? In particular it has many convenience functions to help with the many different ways that people encode dates that _ought_ to be encoded as `YYYY-MM-DD`. --- # Cleaning up .kjh-yellow[congress] ```r library(lubridate) date_recodes <- c("born", "death", "start", "end") df <- df |> mutate(across(any_of(date_recodes), mdy), congress = as.integer(congress) + 78) df ``` ``` ## # A tibble: 20,580 × 26 ## congress last first middle suffix nickname born death sex ## <dbl> <chr> <chr> <chr> <chr> <chr> <date> <date> <chr> ## 1 79 Abernethy Thomas Gerst… <NA> <NA> 1903-05-16 1953-01-23 M ## 2 79 Adams Sherman <NA> <NA> <NA> 1899-01-08 1986-10-27 M ## 3 79 Aiken George David <NA> <NA> 1892-08-20 1984-11-19 M ## 4 79 Allen Asa Leona… <NA> <NA> 1891-01-05 1969-01-05 M ## 5 79 Allen Leo Elwood <NA> <NA> 1898-10-05 1973-01-19 M ## 6 79 Almond J. Linds… Jr. <NA> 1898-06-15 1986-04-14 M ## 7 79 Andersen Herman Carl <NA> <NA> 1897-01-27 1978-07-26 M ## 8 79 Anderson Clinton Presba <NA> <NA> 1895-10-23 1975-11-11 M ## 9 79 Anderson John Zuing… <NA> <NA> 1904-03-22 1981-02-09 M ## 10 79 Andresen August Herman <NA> <NA> 1890-10-11 1958-01-14 M ## # ℹ 20,570 more rows ## # ℹ 17 more variables: position <chr>, party <chr>, state <chr>, ## # district <chr>, start <date>, end <date>, religion <chr>, race <chr>, ## # educational_attainment <chr>, job_type1 <chr>, job_type2 <chr>, ## # job_type3 <chr>, job_type4 <chr>, job_type5 <chr>, mil1 <chr>, mil2 <chr>, ## # mil3 <chr> ``` --- # Cleaning up .kjh-yellow[congress] ```r sessions <- tibble(congress = 79:116, start_year = seq(1945, 2019, by = 2), end_year = seq(1947, 2021, by = 2)) |> mutate(start_year = ymd(paste(start_year, "01", "03", sep = "-")), end_year = ymd(paste(end_year, "01", "03", sep = "-"))) sessions ``` ``` ## # A tibble: 38 × 3 ## congress start_year end_year ## <int> <date> <date> ## 1 79 1945-01-03 1947-01-03 ## 2 80 1947-01-03 1949-01-03 ## 3 81 1949-01-03 1951-01-03 ## 4 82 1951-01-03 1953-01-03 ## 5 83 1953-01-03 1955-01-03 ## 6 84 1955-01-03 1957-01-03 ## 7 85 1957-01-03 1959-01-03 ## 8 86 1959-01-03 1961-01-03 ## 9 87 1961-01-03 1963-01-03 ## 10 88 1963-01-03 1965-01-03 ## # ℹ 28 more rows ``` --- # We're going to join these tables .pull-left[ The big table .SMALL[ ```r df |> select(congress, last, born) ``` ``` ## # A tibble: 20,580 × 3 ## congress last born ## <dbl> <chr> <date> ## 1 79 Abernethy 1903-05-16 ## 2 79 Adams 1899-01-08 ## 3 79 Aiken 1892-08-20 ## 4 79 Allen 1891-01-05 ## 5 79 Allen 1898-10-05 ## 6 79 Almond 1898-06-15 ## 7 79 Andersen 1897-01-27 ## 8 79 Anderson 1895-10-23 ## 9 79 Anderson 1904-03-22 ## 10 79 Andresen 1890-10-11 ## # ℹ 20,570 more rows ``` ] ] .pull-right[ The smaller table .SMALL[ ```r sessions ``` ``` ## # A tibble: 38 × 3 ## congress start_year end_year ## <int> <date> <date> ## 1 79 1945-01-03 1947-01-03 ## 2 80 1947-01-03 1949-01-03 ## 3 81 1949-01-03 1951-01-03 ## 4 82 1951-01-03 1953-01-03 ## 5 83 1953-01-03 1955-01-03 ## 6 84 1955-01-03 1957-01-03 ## 7 85 1957-01-03 1959-01-03 ## 8 86 1959-01-03 1961-01-03 ## 9 87 1961-01-03 1963-01-03 ## 10 88 1963-01-03 1965-01-03 ## # ℹ 28 more rows ``` ] ] --- # We're going to .kjh-orange[join] these tables .SMALL.squish3[We will use .kjh-green[**`left_join()`**] which is what you want most of the time when you are looking to merge a smaller table with additional information into a larger main one. ] .SMALL[ ```r df <- left_join(df, sessions) |> relocate(start_year:end_year, .after = congress) ``` ``` ## Joining with `by = join_by(congress)` ``` ```r df ``` ``` ## # A tibble: 20,580 × 28 ## congress start_year end_year last first middle suffix nickname born ## <dbl> <date> <date> <chr> <chr> <chr> <chr> <chr> <date> ## 1 79 1945-01-03 1947-01-03 Abern… Thom… Gerst… <NA> <NA> 1903-05-16 ## 2 79 1945-01-03 1947-01-03 Adams Sher… <NA> <NA> <NA> 1899-01-08 ## 3 79 1945-01-03 1947-01-03 Aiken Geor… David <NA> <NA> 1892-08-20 ## 4 79 1945-01-03 1947-01-03 Allen Asa Leona… <NA> <NA> 1891-01-05 ## 5 79 1945-01-03 1947-01-03 Allen Leo Elwood <NA> <NA> 1898-10-05 ## 6 79 1945-01-03 1947-01-03 Almond J. Linds… Jr. <NA> 1898-06-15 ## 7 79 1945-01-03 1947-01-03 Ander… Herm… Carl <NA> <NA> 1897-01-27 ## 8 79 1945-01-03 1947-01-03 Ander… Clin… Presba <NA> <NA> 1895-10-23 ## 9 79 1945-01-03 1947-01-03 Ander… John Zuing… <NA> <NA> 1904-03-22 ## 10 79 1945-01-03 1947-01-03 Andre… Augu… Herman <NA> <NA> 1890-10-11 ## # ℹ 20,570 more rows ## # ℹ 19 more variables: death <date>, sex <chr>, position <chr>, party <chr>, ## # state <chr>, district <chr>, start <date>, end <date>, religion <chr>, ## # race <chr>, educational_attainment <chr>, job_type1 <chr>, job_type2 <chr>, ## # job_type3 <chr>, job_type4 <chr>, job_type5 <chr>, mil1 <chr>, mil2 <chr>, ## # mil3 <chr> ``` ] --- # Table joins .footnote[*Spiffy Join Animatations courtesy [Garrick Aden-Buie](github.com/gadenbuie/join-animations-with-gganimate.R)] .top[] --- # Left join, .kjh-yellow[left_join()] .top[] .SMALL[All rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.] --- # Left join (contd), .kjh-yellow[left_join()] .top[] .SMALL[If there are multiple matches between x and y, all combinations of the matches are returned.] --- # Inner join, .kjh-yellow[inner_join()] .top[] .SMALL[All rows from x where there are matching values in y, and all columns from x and y.] --- # Full join, .kjh-yellow[full_join()] .top[] .SMALL[All rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.] --- # Semi join, .kjh-yellow[semi_join()] .top[] .SMALL[All rows from x where there are matching values in y, keeping just columns from x.] --- # Anti join, .kjh-yellow[anti_join()] .top[] .SMALL[All rows from x where there are not matching values in y, keeping just columns from x.] --- # Left join, .kjh-yellow[left_join()] Most of the time you will be looking to make a .kjh-green[**`left_join()`**] --- layout: false class: center # .middle.squish4.huge[.kjh-red[Missing Data]] --- layout: true class: title title-1 --- # Never test for missingness with .kjh-red[`==`] The result of almost any operation involving a missing/unknown value will be missing/unknown. ```r df <- tribble( ~subject, ~age, "A", 20, "B", 25, "C", NA, "D", 34 ) df ``` ``` ## # A tibble: 4 × 2 ## subject age ## <chr> <dbl> ## 1 A 20 ## 2 B 25 ## 3 C NA ## 4 D 34 ``` --- # Never test for missingness with .kjh-red[`==`] The result of almost any operation involving a missing/unknown value will be missing/unknown. ```r # OK df |> filter(age == 25) ``` ``` ## # A tibble: 1 × 2 ## subject age ## <chr> <dbl> ## 1 B 25 ``` -- ```r # Nope df |> filter(age == NA) ``` ``` ## # A tibble: 0 × 2 ## # ℹ 2 variables: subject <chr>, age <dbl> ``` ```r # E.g. 23 == NA ``` ``` ## [1] NA ``` --- # Never test for missingness with .kjh-red[`==`] Always use .kjh-green[**`is.na()`**] instead ```r # Yes df |> filter(is.na(age)) ``` ``` ## # A tibble: 1 × 2 ## subject age ## <chr> <dbl> ## 1 C NA ``` --- # Using .kjh-yellow[naniar] and .kjh-yellow[visdat] ```r # install.packages() if they're not available library(naniar) library(visdat) organdata ``` ``` ## # A tibble: 238 × 21 ## country year donors pop pop_dens gdp gdp_lag health health_lag ## <chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl> ## 1 Australia NA NA 17065 0.220 16774 16591 1300 1224 ## 2 Australia 1991-01-01 12.1 17284 0.223 17171 16774 1379 1300 ## 3 Australia 1992-01-01 12.4 17495 0.226 17914 17171 1455 1379 ## 4 Australia 1993-01-01 12.5 17667 0.228 18883 17914 1540 1455 ## 5 Australia 1994-01-01 10.2 17855 0.231 19849 18883 1626 1540 ## 6 Australia 1995-01-01 10.2 18072 0.233 21079 19849 1737 1626 ## 7 Australia 1996-01-01 10.6 18311 0.237 21923 21079 1846 1737 ## 8 Australia 1997-01-01 10.3 18518 0.239 22961 21923 1948 1846 ## 9 Australia 1998-01-01 10.5 18711 0.242 24148 22961 2077 1948 ## 10 Australia 1999-01-01 8.67 18926 0.244 25445 24148 2231 2077 ## # ℹ 228 more rows ## # ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>, ## # assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>, ## # consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr> ``` --- # Using .kjh-yellow[naniar] and .kjh-yellow[visdat] ```r gg_miss_var(organdata) ``` <img src="10-slides_files/figure-html/07-iterating-on-data-45-1.png" width="576" style="display: block; margin: auto;" /> --- # Using .kjh-yellow[naniar] and .kjh-yellow[visdat] ```r vis_dat(organdata) ``` <img src="10-slides_files/figure-html/07-iterating-on-data-46-1.png" width="576" style="display: block; margin: auto;" /> --- # Using .kjh-yellow[naniar] and .kjh-yellow[visdat] ```r miss_var_summary(organdata) ``` ``` ## # A tibble: 21 × 3 ## variable n_miss pct_miss ## <chr> <int> <dbl> ## 1 year 34 14.3 ## 2 donors 34 14.3 ## 3 opt 28 11.8 ## 4 pubhealth 21 8.82 ## 5 pop 17 7.14 ## 6 pop_dens 17 7.14 ## 7 gdp 17 7.14 ## 8 roads 17 7.14 ## 9 cerebvas 17 7.14 ## 10 assault 17 7.14 ## # ℹ 11 more rows ``` --- # Using .kjh-yellow[naniar] and .kjh-yellow[visdat] ```r miss_case_summary(organdata) ``` ``` ## # A tibble: 238 × 3 ## case n_miss pct_miss ## <int> <int> <dbl> ## 1 84 12 57.1 ## 2 182 12 57.1 ## 3 210 12 57.1 ## 4 14 11 52.4 ## 5 28 11 52.4 ## 6 42 11 52.4 ## 7 56 11 52.4 ## 8 70 11 52.4 ## 9 98 11 52.4 ## 10 112 11 52.4 ## # ℹ 228 more rows ``` --- # Using .kjh-yellow[naniar] and .kjh-yellow[visdat] ```r organdata |> select(consent_law, year, pubhealth, roads) |> group_by(consent_law) |> miss_var_summary() ``` ``` ## # A tibble: 6 × 4 ## # Groups: consent_law [2] ## consent_law variable n_miss pct_miss ## <chr> <chr> <int> <dbl> ## 1 Informed year 16 14.3 ## 2 Informed pubhealth 8 7.14 ## 3 Informed roads 8 7.14 ## 4 Presumed year 18 14.3 ## 5 Presumed pubhealth 13 10.3 ## 6 Presumed roads 9 7.14 ``` --- # Using .kjh-yellow[naniar] and .kjh-yellow[visdat] ```r vis_miss(organdata) ``` <img src="10-slides_files/figure-html/07-iterating-on-data-50-1.png" width="576" style="display: block; margin: auto;" /> --- # Using .kjh-yellow[naniar] and .kjh-yellow[visdat] ```r vis_miss(organdata, cluster = TRUE) ``` <img src="10-slides_files/figure-html/07-iterating-on-data-52-1.png" width="576" style="display: block; margin: auto;" /> --- # Using .kjh-yellow[naniar] and .kjh-yellow[visdat] ```r gg_miss_upset(organdata) ``` <img src="10-slides_files/figure-html/07-iterating-on-data-53-1.png" width="576" style="display: block; margin: auto;" /> --- # .kjh-yellow[Upset plots] and a bit of wrangling  --- # .kjh-yellow[Upset plots] and a bit of wrangling ```r symptoms <- c("Anosmia", "Cough", "Fatigue", "Diarrhea", "Breath", "Fever") names(symptoms) <- symptoms symptoms ``` ``` ## Anosmia Cough Fatigue Diarrhea Breath Fever ## "Anosmia" "Cough" "Fatigue" "Diarrhea" "Breath" "Fever" ``` --- # .kjh-yellow[Upset plots] and a bit of wrangling .SMALL[ ```r # An Excel file! dat <- readxl::read_xlsx(here("files", "data", "symptoms.xlsx")) dat |> print(n = nrow(dat)) ``` ``` ## # A tibble: 32 × 2 ## combination count ## <chr> <dbl> ## 1 Anosmia 140 ## 2 Cough 57 ## 3 Fatigue 198 ## 4 Diarrhea 12 ## 5 Breath 5 ## 6 Fever 11 ## 7 Cough&Fatigue 179 ## 8 Fatigue&Fever 28 ## 9 Breath&Fatigue 10 ## 10 Diarrhea&Fatigue 43 ## 11 Anosmia&Fatigue 281 ## 12 Breath&Cough 1 ## 13 Anosmia&Diarrhea&Fatigue 64 ## 14 Breath&Cough&Fatigue 22 ## 15 Anosmia&Cough&Fatigue 259 ## 16 Anosmia&Fever&Fatigue 46 ## 17 Cough&Fever&Fatigue 54 ## 18 Cough&Diarrhea 7 ## 19 Cough&Diarrhea&Fatigue 31 ## 20 Anosmia&Breath&Cough&Fatigue 26 ## 21 Anosmia&Cough&Fatigue&Fever 69 ## 22 Anosmia&Breath&Cough&Diarrhea&Fatigue 18 ## 23 Anosmia&Breath&Cough&Fatigue&Fever 17 ## 24 Breath&Cough&Fatigue&Fever 11 ## 25 Breath&Cough&Diarrhea&Fatigue 7 ## 26 Breath&Cough&Diarrhea&Fatigue&Fever 8 ## 27 Diarrhea&Fatigue&Fever 12 ## 28 Cough&Diarrhea&Fatigue&Fever 17 ## 29 Anosmia&Diarrhea&Fatigue&Fever 17 ## 30 Anosmia&Diarrhea&Cough&Fatigue 41 ## 31 Anosmia&Breath&Cough&Diarrhea&Fatigue&Fever 23 ## 32 Anosmia&Cough&Diarrhea&Fatigue&Fever 50 ``` ] --- # .kjh-yellow[Upset plots] and a bit of wrangling .SMALL[ ```r subsets <- dat |> pull(combination) ## Check if each subset mentions each symptom or not symptom_mat <- map_dfc(subsets, str_detect, symptoms) |> data.frame() |> t() |> # transpose the result, this is a little gross, sorry as_tibble(.name_repair = "unique") colnames(symptom_mat) <- symptoms symptom_mat$count <- dat$count ``` ] --- # .kjh-yellow[Upset plots] and a bit of wrangling Now we have a table we can do something with. .SMALL[ ```r symptom_mat |> print(n = nrow(symptom_mat)) ``` ``` ## # A tibble: 32 × 7 ## Anosmia Cough Fatigue Diarrhea Breath Fever count ## <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <dbl> ## 1 TRUE FALSE FALSE FALSE FALSE FALSE 140 ## 2 FALSE TRUE FALSE FALSE FALSE FALSE 57 ## 3 FALSE FALSE TRUE FALSE FALSE FALSE 198 ## 4 FALSE FALSE FALSE TRUE FALSE FALSE 12 ## 5 FALSE FALSE FALSE FALSE TRUE FALSE 5 ## 6 FALSE FALSE FALSE FALSE FALSE TRUE 11 ## 7 FALSE TRUE TRUE FALSE FALSE FALSE 179 ## 8 FALSE FALSE TRUE FALSE FALSE TRUE 28 ## 9 FALSE FALSE TRUE FALSE TRUE FALSE 10 ## 10 FALSE FALSE TRUE TRUE FALSE FALSE 43 ## 11 TRUE FALSE TRUE FALSE FALSE FALSE 281 ## 12 FALSE TRUE FALSE FALSE TRUE FALSE 1 ## 13 TRUE FALSE TRUE TRUE FALSE FALSE 64 ## 14 FALSE TRUE TRUE FALSE TRUE FALSE 22 ## 15 TRUE TRUE TRUE FALSE FALSE FALSE 259 ## 16 TRUE FALSE TRUE FALSE FALSE TRUE 46 ## 17 FALSE TRUE TRUE FALSE FALSE TRUE 54 ## 18 FALSE TRUE FALSE TRUE FALSE FALSE 7 ## 19 FALSE TRUE TRUE TRUE FALSE FALSE 31 ## 20 TRUE TRUE TRUE FALSE TRUE FALSE 26 ## 21 TRUE TRUE TRUE FALSE FALSE TRUE 69 ## 22 TRUE TRUE TRUE TRUE TRUE FALSE 18 ## 23 TRUE TRUE TRUE FALSE TRUE TRUE 17 ## 24 FALSE TRUE TRUE FALSE TRUE TRUE 11 ## 25 FALSE TRUE TRUE TRUE TRUE FALSE 7 ## 26 FALSE TRUE TRUE TRUE TRUE TRUE 8 ## 27 FALSE FALSE TRUE TRUE FALSE TRUE 12 ## 28 FALSE TRUE TRUE TRUE FALSE TRUE 17 ## 29 TRUE FALSE TRUE TRUE FALSE TRUE 17 ## 30 TRUE TRUE TRUE TRUE FALSE FALSE 41 ## 31 TRUE TRUE TRUE TRUE TRUE TRUE 23 ## 32 TRUE TRUE TRUE TRUE FALSE TRUE 50 ``` ] --- # .kjh-yellow[Upset plots] and a bit of wrangling Uncounting tables .SMALL[ ```r indvs <- symptom_mat |> uncount(count) indvs ``` ``` ## # A tibble: 1,764 × 6 ## Anosmia Cough Fatigue Diarrhea Breath Fever ## <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> ## 1 TRUE FALSE FALSE FALSE FALSE FALSE ## 2 TRUE FALSE FALSE FALSE FALSE FALSE ## 3 TRUE FALSE FALSE FALSE FALSE FALSE ## 4 TRUE FALSE FALSE FALSE FALSE FALSE ## 5 TRUE FALSE FALSE FALSE FALSE FALSE ## 6 TRUE FALSE FALSE FALSE FALSE FALSE ## 7 TRUE FALSE FALSE FALSE FALSE FALSE ## 8 TRUE FALSE FALSE FALSE FALSE FALSE ## 9 TRUE FALSE FALSE FALSE FALSE FALSE ## 10 TRUE FALSE FALSE FALSE FALSE FALSE ## # ℹ 1,754 more rows ``` ] Now we've reconstructed the individual-level observations. --- # .kjh-yellow[Upset plots] and a bit of wrangling ```r # devtools::install_github("krassowski/complex-upset") library(ComplexUpset) upset(data = indvs, intersect = symptoms, name="Symptom Groupings by Frequency. Total pool is 1,764 individuals.", min_size = 0, width_ratio = 0.125) + labs(title = "Co-Occurence of COVID-19 Symptoms", caption = "Data: covid.joinzoe.com/us | Graph: @kjhealy") ``` --- # .kjh-yellow[Upset plots] and a bit of wrangling <img src="10-slides_files/figure-html/07-iterating-on-data-60-1.png" width="864" style="display: block; margin: auto;" /> --- layout: true class: title title-1 --- # Selection Effects -- ## Missingness is closely related to Selection Effects and Survivorship bias -- ## Both are directly connected to the question of making good inferences from data --- layout: false class: center .top[] --- layout: true class: title title-1 --- # Sampling on the dependent variable ### If you're looking to see what causes some outcome, you can't establish the causes if you only collect data for cases where the outcome occurred. -- ### This seems straightforward for simple cases but can be quite subtle in practice -- ### In particular it can be hard to be sure that the data-generating process isn't subtly biased ("Selection vs Treatment") -- ### And if unobserved pathways to the outcome are associated with the observed ones, you can get spurious associations --- # Example .top[] --- # Example .top[] --- # Example .top[] --- layout: false class: center # .middle.squish4.huge[.kjh-red[Problems with this approach]] --- layout: true class: title title-1 --- # Causal Paths ### Simplest version: You're only observing people _after_ the GRE selection criterion has been applied -- ### More likely: There's more than one way to be admitted; and no-one is admitted without some expectation of success -- ### The non-admitted people are not observed; and amongst those who are only one admissions signal (the GRE) is observed -- ### Conditional on being good enough to be admitted, GRE will not predict "success". --- # Example: Success in the NBA ```r # devtools::install_github("sportsdataverse/hoopR") df <- read_csv(here("files", "data", "nba_players_2023_04_03.csv"), show_col_types = FALSE) num_vars <- c("age", "player_height_inches", "player_weight", colnames(df)[14:23]) df <- df |> mutate(across(all_of(num_vars), ~ as.numeric(.x))) ``` --- # Example: Success in the NBA ```r df |> select(player_name, team_abbreviation, player_height_inches, gp, pts) ``` ``` ## # A tibble: 528 × 5 ## player_name team_abbreviation player_height_inches gp pts ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 A.J. Lawson DAL 78 13 34 ## 2 AJ Green MIL 77 33 148 ## 3 AJ Griffin ATL 78 68 600 ## 4 Aaron Gordon DEN 80 65 1068 ## 5 Aaron Holiday ATL 72 59 232 ## 6 Aaron Nesmith IND 77 69 688 ## 7 Aaron Wiggins OKC 77 66 433 ## 8 Admiral Schofield ORL 77 33 135 ## 9 Al Horford BOS 81 61 594 ## 10 Alec Burks DET 78 51 652 ## # ℹ 518 more rows ``` --- # Example: Success in the NBA ```r df |> ggplot(aes(x = player_height_inches, y = pts/gp)) + geom_jitter(position = position_jitter(height = 0, width = 0.2)) + geom_smooth(se = FALSE, method = "lm") + labs(x = "Player height (inches)", y = "Points per game", title = "Player height and point scoring in the NBA", subtitle = "Data are for all current NBA players") ``` <img src="10-slides_files/figure-html/unnamed-chunk-3-1.png" width="720" style="display: block; margin: auto;" /> --- # Example: Success in the NBA ```r out <- df |> # Exclude two outliers with v. low ratings due to no game time filter(net_rating > -100) |> mutate(ppg = pts/gp) |> select(player_id, player_name, player_height_inches, net_rating, ppg, ts_pct) |> pivot_longer(cols = net_rating:ts_pct) |> mutate(name = case_match(name, "net_rating" ~ "Net Rating", "ppg" ~ "Points per Game", "ts_pct" ~ "True Shooting Percentage")) |> ggplot(aes(x = player_height_inches, y = value)) + geom_jitter(position = position_jitter(width = 0.1, height = 0), pch = 1) + geom_smooth(se = FALSE, method = "lm") + facet_wrap(~ name, nrow = 1, scales = "free_y") + labs(x = "Player height (inches)", y = "Value", title = "Player height and performance in the NBA", subtitle = "Data are for current NBA players", caption = "Two players with almost no game time are excluded." ) ``` --- # Example: Success in the NBA ```r print(out) ``` <img src="10-slides_files/figure-html/unnamed-chunk-5-1.png" width="864" style="display: block; margin: auto;" /> --- # Imaginary Hollywood Example ```r hw <- tibble(looks = rnorm(1e5), talent = rnorm(1e5), total = looks + talent, hollywood = total > quantile(total, 0.98)) out <- hw |> ggplot(mapping = aes(x = looks, y = talent)) + geom_point(alpha = 0.1) + geom_smooth(method = "lm") + labs("The Pool") ``` --- # Imaginary Hollywood Example ```r print(out) ``` <img src="10-slides_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" /> --- # Imaginary Hollywood Example ```r hw_obs <- hw |> filter(hollywood == TRUE) |> slice_sample(prop = 0.2) out_obs <- hw_obs |> ggplot(mapping = aes(x = looks, y = talent)) + geom_point(alpha = 0.2) + geom_smooth(method = "lm") + labs(title = "Observed (in the movies)") ``` --- # Imaginary Hollywood Example ```r print(out_obs) ``` <img src="10-slides_files/figure-html/unnamed-chunk-9-1.png" width="576" style="display: block; margin: auto;" /> --- # Why this happens in the example ```r out <- hw |> ggplot(mapping = aes(x = looks, y = talent, color = hollywood)) + geom_point(alpha = 0.1) + geom_smooth(method = "lm") ``` --- # Why this happens in the example ```r print(out) ``` <img src="10-slides_files/figure-html/unnamed-chunk-11-1.png" width="576" style="display: block; margin: auto;" />