library(here) # manage file paths
library(socviz) # data and some useful functions
library(tidyverse) # your friend and mine
library(gapminder) # some data
January 31, 2024
codeTidy data
Tidy data is in long format
Grolemund & Wickham
Grolemund & Wickham
Grolemund & Wickham
Very, very often, the solution to some data-wrangling or data visualization problem in a Tidyverse-focused workflow is:
Very, very often, the solution to some data-wrangling or data visualization problem in a Tidyverse-focused workflow is:
First, get the data into long format
Then do the thing you want.
Storing and printing data in long format entails a lot of repetition:
penguins |>
group_by(species, island, year) |>
summarize(bill = round(mean(bill_length_mm, na.rm = TRUE),2)) |>
species | island | year | bill |
Adelie | Biscoe | 2007 | 38.32 |
Adelie | Biscoe | 2008 | 38.70 |
Adelie | Biscoe | 2009 | 39.69 |
Adelie | Dream | 2007 | 39.10 |
Adelie | Dream | 2008 | 38.19 |
Adelie | Dream | 2009 | 38.15 |
Adelie | Torgersen | 2007 | 38.80 |
Adelie | Torgersen | 2008 | 38.77 |
Adelie | Torgersen | 2009 | 39.31 |
Chinstrap | Dream | 2007 | 48.72 |
Chinstrap | Dream | 2008 | 48.70 |
Chinstrap | Dream | 2009 | 49.05 |
Gentoo | Biscoe | 2007 | 47.01 |
Gentoo | Biscoe | 2008 | 46.94 |
Gentoo | Biscoe | 2009 | 48.50 |
A wide format is easier and more efficient to read in print:
penguins |>
group_by(species, island, year) |>
summarize(bill = round(mean(bill_length_mm, na.rm = TRUE), 2)) |>
pivot_wider(names_from = year, values_from = bill) |>
species | island | 2007 | 2008 | 2009 |
Adelie | Biscoe | 38.32 | 38.70 | 39.69 |
Adelie | Dream | 39.10 | 38.19 | 38.15 |
Adelie | Torgersen | 38.80 | 38.77 | 39.31 |
Chinstrap | Dream | 48.72 | 48.70 | 49.05 |
Gentoo | Biscoe | 47.01 | 46.94 | 48.50 |
A wide format is easier and more efficient to read in print:
penguins |>
group_by(species, year, island) |>
summarize(bill = round(mean(bill_length_mm, na.rm = TRUE), 2)) |>
pivot_wider(names_from = island, values_from = bill) |>
species | year | Biscoe | Dream | Torgersen |
Adelie | 2007 | 38.32 | 39.10 | 38.80 |
Adelie | 2008 | 38.70 | 38.19 | 38.77 |
Adelie | 2009 | 39.69 | 38.15 | 39.31 |
Chinstrap | 2007 | NA | 48.72 | NA |
Chinstrap | 2008 | NA | 48.70 | NA |
Chinstrap | 2009 | NA | 49.05 | NA |
Gentoo | 2007 | 47.01 | NA | NA |
Gentoo | 2008 | 46.94 | NA | NA |
Gentoo | 2009 | 48.50 | NA | NA |
Spot the untidiness
Prevention is better than cure!
An excellent article by Karl Broman and Kara Woo:
Data organization in spreadsheets
operationPivoting from wide to long:
# A tibble: 366 × 11
age sex year total elem4 elem8 hs3 hs4 coll3 coll4 median
<chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 25-34 Male 2016 21845 116 468 1427 6386 6015 7432 NA
2 25-34 Male 2015 21427 166 488 1584 6198 5920 7071 NA
3 25-34 Male 2014 21217 151 512 1611 6323 5910 6710 NA
4 25-34 Male 2013 20816 161 582 1747 6058 5749 6519 NA
5 25-34 Male 2012 20464 161 579 1707 6127 5619 6270 NA
6 25-34 Male 2011 20985 190 657 1791 6444 5750 6151 NA
7 25-34 Male 2010 20689 186 641 1866 6458 5587 5951 NA
8 25-34 Male 2009 20440 184 695 1806 6495 5508 5752 NA
9 25-34 Male 2008 20210 172 714 1874 6356 5277 5816 NA
10 25-34 Male 2007 20024 246 757 1930 6361 5137 5593 NA
# ℹ 356 more rows
Here, a “Level of Schooling Attained” variable is spread across the columns, from elem4
to coll4
. We need a key column called “education” with the various levels of schooling, and a corresponding value column containing the counts.
We’re going to put the columns elem4:coll4
into a new column, creating a new categorical measure named education
. The numbers currently under each column will become a new value
column corresponding to that level of education.
# A tibble: 2,196 × 7
age sex year total median education value
<chr> <chr> <int> <int> <dbl> <chr> <dbl>
1 25-34 Male 2016 21845 NA elem4 116
2 25-34 Male 2016 21845 NA elem8 468
3 25-34 Male 2016 21845 NA hs3 1427
4 25-34 Male 2016 21845 NA hs4 6386
5 25-34 Male 2016 21845 NA coll3 6015
6 25-34 Male 2016 21845 NA coll4 7432
7 25-34 Male 2015 21427 NA elem4 166
8 25-34 Male 2015 21427 NA elem8 488
9 25-34 Male 2015 21427 NA hs3 1584
10 25-34 Male 2015 21427 NA hs4 6198
# ℹ 2,186 more rows
We can name the value column to whatever we like. Here it’s a number of people.
# A tibble: 2,196 × 7
age sex year total median education n
<chr> <chr> <int> <int> <dbl> <chr> <dbl>
1 25-34 Male 2016 21845 NA elem4 116
2 25-34 Male 2016 21845 NA elem8 468
3 25-34 Male 2016 21845 NA hs3 1427
4 25-34 Male 2016 21845 NA hs4 6386
5 25-34 Male 2016 21845 NA coll3 6015
6 25-34 Male 2016 21845 NA coll4 7432
7 25-34 Male 2015 21427 NA elem4 166
8 25-34 Male 2015 21427 NA elem8 488
9 25-34 Male 2015 21427 NA hs3 1584
10 25-34 Male 2015 21427 NA hs4 6198
# ℹ 2,186 more rows
# This is not portable!
df <- read_csv("/Users/kjhealy/Documents/data/misc/project/data/mydata.csv")
The here
package, and here()
function builds paths relative to the top level of your R project.
This seminar’s files all live in an RStudio project. It looks like this:
├── 00_dummy_files
├── R
├── README.qmd
├── _extensions
├── _freeze
├── _quarto.yml
├── _site
├── _targets
├── _targets.R
├── _variables.yml
├── about
├── assignment
├── content
├── data
├── example
├── files
├── grades
├── html
├── images
├── index.html
├── index.qmd
├── merm.txt
├── projects
├── renv
├── renv.lock
├── schedule
├── site_libs
├── slides
├── staging
├── syllabus
└── vsd.Rproj
I want to load files from the data
folder, but I also want you to be able to load them. I’m writing this from somewhere deep in the slides
folder, but you won’t be there.
# A tibble: 238 × 21
country year donors pop pop.dens gdp gdp.lag health health.lag pubhealth
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Austra… NA NA 17065 0.220 16774 16591 1300 1224 4.8
2 Austra… 1991 12.1 17284 0.223 17171 16774 1379 1300 5.4
3 Austra… 1992 12.4 17495 0.226 17914 17171 1455 1379 5.4
4 Austra… 1993 12.5 17667 0.228 18883 17914 1540 1455 5.4
5 Austra… 1994 10.2 17855 0.231 19849 18883 1626 1540 5.4
6 Austra… 1995 10.2 18072 0.233 21079 19849 1737 1626 5.5
7 Austra… 1996 10.6 18311 0.237 21923 21079 1846 1737 5.6
8 Austra… 1997 10.3 18518 0.239 22961 21923 1948 1846 5.7
9 Austra… 1998 10.5 18711 0.242 24148 22961 2077 1948 5.9
10 Austra… 1999 8.67 18926 0.244 25445 24148 2231 2077 6.1
# ℹ 228 more rows
# ℹ 11 more variables: roads <dbl>, cerebvas <dbl>, assault <dbl>,
# external <dbl>, txp.pop <dbl>, world <chr>, opt <chr>, <chr>,
# consent.practice <chr>, consistent <chr>, ccode <chr>
And there it is.
has variantsread_csv()
Field separator is a comma: ,
Field separator is a semicolon: ;
Both are special cases of read_delim()
Tab separated.read_fwf()
Fixed-width files.read_log()
Log files (i.e. computer log files).read_lines()
Just read in lines, without trying to parse them.read_table()
For data that’s separated by one (or more) columns of space.
The haven package provides
SAS TransportMake these functions available with library
, .tar.gz
) will be automatically uncompressed.# A tibble: 238 × 21
country year donors pop pop.dens gdp gdp.lag health health.lag pubhealth
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Austra… NA NA 17065 0.220 16774 16591 1300 1224 4.8
2 Austra… 1991 12.1 17284 0.223 17171 16774 1379 1300 5.4
3 Austra… 1992 12.4 17495 0.226 17914 17171 1455 1379 5.4
4 Austra… 1993 12.5 17667 0.228 18883 17914 1540 1455 5.4
5 Austra… 1994 10.2 17855 0.231 19849 18883 1626 1540 5.4
6 Austra… 1995 10.2 18072 0.233 21079 19849 1737 1626 5.5
7 Austra… 1996 10.6 18311 0.237 21923 21079 1846 1737 5.6
8 Austra… 1997 10.3 18518 0.239 22961 21923 1948 1846 5.7
9 Austra… 1998 10.5 18711 0.242 24148 22961 2077 1948 5.9
10 Austra… 1999 8.67 18926 0.244 25445 24148 2231 2077 6.1
# ℹ 228 more rows
# ℹ 11 more variables: roads <dbl>, cerebvas <dbl>, assault <dbl>,
# external <dbl>, txp.pop <dbl>, world <chr>, opt <chr>, <chr>,
# consent.practice <chr>, consistent <chr>, ccode <chr>
How does
do this?
’s flow of actionFlow of action
Flow of action
’s flow of actionWhat we start with
’s flow of actionWhere we’re going
’s flow of actionCore steps
’s flow of actionOptional steps
’s flow of action: requiredTidy data
’s flow of action: requiredAesthetic mappings
’s flow of action: requiredGeom
Let’s go piece by piece
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
Data is the gapminder
Tell ggplot
the variables you want represented by visual elements on the plot
The mapping
call links variables to things you will see on the plot.
and y
represent the quantities determining position on the x and y axes.
Other aesthetic mappings can include, e.g., color
, shape
, size
, and fill
Mappings do not directly specify the particular, e.g., colors, shapes, or line styles that will appear on the plot. Rather, they establish which variables in the data will be represented by which visible elements on the plot.
has data and mappings but no geomThis empty plot has no geoms.
A scatterplot of Life Expectancy vs GDP
A scatterplot of Life Expectancy vs GDP
Life Expectancy vs GDP, using a smoother.
Life Expectancy vs GDP, using a smoother.
is a functionFunctions take arguments
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point() +
geom_smooth(method = "lm") +
scale_x_log10(labels = scales::label_dollar()) +
labs(x = "GDP Per Capita",
y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.")
Mapping vs Setting
your plot’s aesthetics
, size
, and alpha
. Meanwhile x
and y
are mapped.alpha
for overplottingp <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
scale_x_log10(labels = scales::label_dollar()) +
labs(x = "GDP Per Capita",
y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.")
Map or Set values
per geom
Pay attention to which scales and guides are drawn, and why
mappingsmapping = aes(color = continent, fill = continent)
mappingsmapping = aes(color = continent, fill = continent)
mapping = aes(color = continent)
Remember: Every mapped variable has a scale
## Save the most recent plot
ggsave(filename = "figures/my_figure.png")
## Use here() for more robust file paths
ggsave(filename = here("figures", "my_figure.png"))
## A plot object
p_out <- p + geom_point(mapping = aes(color = log(pop))) +
ggsave(filename = here("figures", "lifexp_vs_gdp_gradient.pdf"),
plot = p_out)
ggsave(here("figures", "lifexp_vs_gdp_gradient.png"),
plot = p_out,
width = 8,
height = 5)
Set options in any chunk:
implements a grammar of graphicsThe grammar is a set of rules for how to .kjh-lblueproduce graphics from data, by mapping data to or representing it by geometric objects (like points and lines) that have aesthetic attributes (like position, color, size, and shape), together with further rules for transforming data if needed, for adjusting scales and their guides, and for projecting results onto some coordinate system.
Like other rules of syntax, the grammar
limits what you can validly say
but it doesn’t automatically make
what you say
sensible or meaningful
aesthetic# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
A facet is not a geom; it’s a way of arranging repeated geoms by some additional variable
Facets use R’s “formula” syntax: facet_wrap(~ continent)
Read the ~
as “on” or “by”
You can also use this syntax: facet_wrap(vars(continent))
This is newer, and consistent with other ways of referring to variables within tidyverse functions.
p <- ggplot(data = gapminder,
mapping = aes(x = year,
y = gdpPercap))
p_out <- p + geom_line(color="gray70",
mapping=aes(group = country)) +
geom_smooth(size = 1.1,
method = "loess",
se = FALSE) +
scale_y_log10(labels=scales::label_dollar()) +
facet_wrap(~ continent, ncol = 5) +
labs(x = "Year",
y = "log GDP per capita",
title = "GDP per capita on Five Continents",
caption = "Data: Gapminder")
dataset# A tibble: 437 × 28
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAN… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
7 567 CALHOUN IL 0.017 5322 313. 5298 1 8
8 568 CARROLL IL 0.027 16805 622. 16519 111 30
9 569 CASS IL 0.024 13437 560. 13384 16 8
10 570 CHAMPA… IL 0.058 173025 2983. 146506 16559 331
# ℹ 427 more rows
# ℹ 19 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>
functions behind the scenes`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here the default stat_
function for this geom has to make a choice. It is letting us know we might want to override it.
functions behind the scenesbinwidth
verb filter()
to subset rows of the data by some condition.position
argument out, or changing it to "dodge"
here is not in our data! It’s computed. Histogram and density geoms have default statistics, but you can ask them to do more. The after_stat
functions can do this work for us. fate sex n percent
1 perished male 1364 62.0
2 perished female 126 5.7
3 survived male 367 16.7
4 survived female 344 15.6
tries to count up data by category. (Really it’s the stat_count()
function that does this behind the scenes.) By saying stat="identity"
we explicitly tell it not to do that. This also allows us to use a y
mapping. Normally this would be the result of the counting up."stack"
), side-by-side ("dodge"
), or taken as-is ("identity"
function controls the styling of parts of the plot that don’t belong to its “grammatical” structure. That is, that are not contributing to directly representing data.geom_col()
assumes stat = "identity"
by default. It’s for when you want to directly plot a table of values, rather than create a bar chart by summing over one varible categorized by another.geom_col()
for thresholds# A tibble: 57 × 5
# Groups: year [57]
year other usa diff hi_lo
<int> <dbl> <dbl> <dbl> <chr>
1 1960 68.6 69.9 1.30 Below
2 1961 69.2 70.4 1.20 Below
3 1962 68.9 70.2 1.30 Below
4 1963 69.1 70 0.900 Below
5 1964 69.5 70.3 0.800 Below
6 1965 69.6 70.3 0.700 Below
7 1966 69.9 70.3 0.400 Below
8 1967 70.1 70.7 0.600 Below
9 1968 70.1 70.4 0.300 Below
10 1969 70.1 70.6 0.5 Below
# ℹ 47 more rows
is difference in years with respect to the U.S.hi_lo
is a flag saying whether the OECD is above or below the U.S.p <- ggplot(data = oecd_sum,
mapping = aes(x = year,
y = diff,
fill = hi_lo))
p_out <- p + geom_col() +
geom_hline(yintercept = 0, size = 1.2) +
guides(fill = "none") +
labs(x = NULL,
y = "Difference in Years",
title = "The U.S. Life Expectancy Gap",
subtitle = "Difference between U.S. and
OECD average life expectancies, 1960-2015",
caption = "Data: OECD.")
doesn’t take any data argument. It just draws a horizontal line with a given y-intercept.x = NULL
means “Don’t label the x-axis (not even with the default value, the variable name).geom_col()
for thresholds