class: titleSlide, hide_logo # Data Wrangling ## Iteration <br> <center><img src="data:image/png;base64,#logo.png" width="200px"/></center> --- class: left, hide_logo ### This might be your experience thus far <br> <center><img src="img/typical.png"></center> --- class: left, hide_logo ### But interation might be in your future <br> <center><img src="img/typical2.png"></center> --- class: left, hide_logo ### It might even be in your present Do you ever find yourself copying/pasting blocks of code and changing only 1 or 2 things each time? --- class: left, hide_logo ### It might even be in your present Take this example, but imagine you want 3 separate plots, not facets. .pull-left[ ```r ggplot(penguins, aes(x = bill_length_mm)) + geom_histogram() + facet_wrap(~species, nrow = 3) + theme_bw() + theme(plot.title = element_text(face="bold")) ``` ] .pull-right[ <img src="data:image/png;base64,#wrangling8_deck_files/figure-html/unnamed-chunk-2-1.png" width="100%" /> ] --- class: left, hide_logo ### Copy/paste with few changes ```r p1 <- penguins %>% filter(species == `"Adelie"`) %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = "Distribution of bill length among `Adelie` penguins") + theme_bw() + theme(plot.title = element_text(face="bold")) p2 <- penguins %>% filter(species == `"Gentoo"`) %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = "Distribution of bill length among `Gentoo` penguins") + theme_bw() + theme(plot.title = element_text(face="bold")) p3 <- penguins %>% filter(species == `"Chinstrap"`) %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = "Distribution of bill length among `Chinstrap` penguins") + theme_bw() + theme(plot.title = element_text(face="bold")) ``` --- class: newTopicSub, hide_logo # If you copy/paste more than twice, you probably need a function --- class: left, hide_logo ### Here's a function to create these plots ```r plot_bill_length <- function(penguin_species) { penguins %>% filter(species == penguin_species) %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = glue::glue("Distribution of bill length among", penguin_species, "penguins", .sep = " ")) + theme_bw() + theme(plot.title = element_text(face="bold")) } ``` --- class: left, hide_logo ### Give into writing functions Functions make your code easier to understand with fewer opportunities to mess up .panelset[ .panel[.panel-name[This] ```r p1 <- penguins %>% filter(species == "Adelie") %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = "Distribution of bill length among Adelie penguins") + theme_bw() + theme(plot.title = element_text(face="bold")) p2 <- penguins %>% filter(species == "Gentoo") %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = "Distribution of bill length among Gentoo penguins") + theme_bw() + theme(plot.title = element_text(face="bold")) p3 <- penguins %>% filter(species == "Chinstrap") %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = "Distribution of bill length among Chinstrap penguins") + theme_bw() + theme(plot.title = element_text(face="bold")) ``` ] .panel[.panel-name[Becomes] ```r p1 <- plot_bill_length(penguin_species = "Adelie") p2 <- plot_bill_length(penguin_species = "Gentoo") p3 <- plot_bill_length(penguin_species = "Chinstrap") ``` ] ] --- class: left, hide_logo ### Let's break this down ```r plot_bill_length <- function(penguin_species) { penguins %>% filter(species == penguin_species) %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = glue::glue("Distribution of bill length among", penguin_species, "penguins", .sep = " ")) + theme_bw() + theme(plot.title = element_text(face="bold")) } ``` --- class: left, hide_logo ### Let's break this down `function()` is where you define the inputs (or arguments) ```r function_name <- `function(x)` { # stuff the function does } ``` -- `{ }` is where you define what the function does ```r function_name <- function(x) `{ # stuff the function does }` } ``` --- class: left, hide_logo ### Example * Our function is called `plot_bill_length` (short, verb) * It takes only one argument, `penguin_species` and it has no default ```r plot_bill_length <- function(`penguin_species`) { # stuff the function does } # example with a default value for the argument plot_bill_length2 <- function(penguin_species = "Gentoo") { # stuff the function does } ``` --- class: left, hide_logo ### Inside `{ }` is familiar `ggplot()` code ```r plot_bill_length <- function(penguin_species) { penguins %>% filter(species == penguin_species) %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = glue::glue("Distribution of bill length among", penguin_species, "penguins", .sep = " ")) + theme_bw() + theme(plot.title = element_text(face="bold")) } ``` --- class: left, hide_logo ### The only new pieces are two placeholders ```r plot_bill_length <- function(penguin_species) { penguins %>% filter(species == `penguin_species`) %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = glue::glue("Distribution of bill length among", `penguin_species`, "penguins", .sep = " ")) + theme_bw() + theme(plot.title = element_text(face="bold")) } ``` --- class: left, hide_logo ### The only difference is two placeholders When you run `plot_bill_length(penguin_species = "Gentoo")`, "Gentoo" gets passed to `penguin_species`. ```r plot_bill_length <- function(penguin_species) { penguins %>% filter(species == `penguin_species`) %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = glue::glue("Distribution of bill length among", `penguin_species`, "penguins", .sep = " ")) + theme_bw() + theme(plot.title = element_text(face="bold")) } ``` --- class: left, hide_logo ### This is what R understands When you run `plot_bill_length(penguin_species = "Gentoo")`, "Gentoo" gets passed to `penguin_species`. ```r penguins %>% filter(species == `"Gentoo"`) %>% ggplot(aes(x = bill_length_mm)) + geom_histogram() + labs(x = "Bill length (mm)", y = NULL, title = "Distribution of bill length among `Gentoo` penguins") + theme_bw() + theme(plot.title = element_text(face="bold")) } ``` --- class: left, hide_logo ### Stop copying/pasting, write functions instead .pull-left[ **This works** <center><img src="img/ok.png"></center> ] .pull-right[ **But this is better** <center><img src="img/better.png"></center> ] --- class: newTopicSub, hide_logo # But what if we wanted to make not 3, but 33, 333, or 3333 plots? --- class: left, hide_logo ### For loops + functions * for `(each item in the sequence)` * `{do something}` ```r for (`s` in c("Adelie", "Gentoo", "Chinstrap")) { print(plot_bill_length(penguin_species = `s`)) # alternatively ggsave(filename = glue::glue(s, ".png")) } ``` This loop runs the function `plot_bill_length()` three times. --- class: left, hide_logo ### Here's a basic for loop setup Compute the mean of every column in mtcars ```r output <- vector("double", ncol(mtcars)) names(output) <- names(mtcars) for (i in names(mtcars)) { output[i] <- mean(mtcars[[i]]) } output ``` ``` ## mpg cyl disp hp drat wt qsec ## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 ## vs am gear carb ## 0.437500 0.406250 3.687500 2.812500 ``` --- class: left, hide_logo ### Here's another example Start by creating an object to store the output of the loop ```r output <- vector("double", ncol(mtcars)) output ``` ``` ## [1] 0 0 0 0 0 0 0 0 0 0 0 ``` --- class: left, hide_logo ### Give each element a name ```r output <- vector("double", ncol(mtcars)) names(output) <- names(mtcars) output ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## 0 0 0 0 0 0 0 0 0 0 0 ``` --- class: left, hide_logo ### Define the sequence ```r for (`i` in `names(mtcars)`) { } ``` -- Each time the loop loops, `i` takes a different value: ```r names(mtcars) ``` ``` ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" ``` --- class: left, hide_logo ### Define what happens in the loop Calculate the mean, store it in the object `output` ```r for (i in names(mtcars)) { `output[i] <- mean(mtcars[[i]])` } ``` -- On the first iteration, `i` takes the value `"mpg"`, the first element of the sequence. ```r output[`"mpg"`] <- mean(mtcars[[`"mpg"`]]) ``` --- class: left, hide_logo ### All together now ```r output <- vector("double", ncol(mtcars)) names(output) <- names(mtcars) for (i in names(mtcars)) { output[i] <- mean(mtcars[[i]]) } output ``` ``` ## mpg cyl disp hp drat wt qsec ## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 ## vs am gear carb ## 0.437500 0.406250 3.687500 2.812500 ``` --- class: left, hide_logo ### Evergreen sidenote: There are usually many pathways from A to B A loop is not strictly required in this case ```r mtcars %>% summarise_all(mean) ``` ``` ## mpg cyl disp hp drat wt qsec vs am ## 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 ## gear carb ## 1 3.6875 2.8125 ``` --- class: newTopicSub, hide_logo # Functional programming can often be a better choice than for loops --- class: left, hide_logo ### Functionals A functional is a function that takes a function as an input and returns a vector as output. --- class: left, hide_logo ## `{purrr}` <center><img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/2d0701b616efa7435cd5a94e703baa595a4f9ed0/d41b9/css/images/hex/purrr.png" "width=200px"/></center> --- class: left, hide_logo ### Here's an example We pass our `plot_bill_length()` function to the `purrr:map()` function. It iterates over the vector `c("Adelie", "Gentoo", "Chinstrap")` to produce three plots. ```r map(c("Adelie", "Gentoo", "Chinstrap"), ~ plot_bill_length(.)) ``` `~` is pronounced "twiddle" --- class: left, hide_logo ### Here's what it looks like graphically <center><img src="data:image/png;base64,#img/map.png"/></center> * Under the hood `map()` looks like a list but the `map()` function is written in the C language to enhance performance, etc. * `map()` is similar to the base function `lapply()` --- class: left, hide_logo ### The map functions The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. There is one function for each type of output: * `map()` makes a list * `map_lgl()` makes a logical vector * `map_int()` makes an integer vector * `map_dbl()` makes a double vector * `map_chr()` makes a character vector --- class: left, hide_logo ### Anonymous functions Instead of using map() with an existing function, you can create an inline anonymous function ```r map_dbl(mtcars, ~ length(unique(.x))) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## 25 3 27 22 22 29 30 2 2 3 6 ``` --- class: left, hide_logo ### Passing arguments with `...` It's often convenient to pass along additional arguments to the function that you’re calling. For example, you might want to pass `na.rm = TRUE` along to `mean()`. One way to do that is with an anonymous function: ```r x <- list(1:5, c(1:10, NA)) x ``` ``` ## [[1]] ## [1] 1 2 3 4 5 ## ## [[2]] ## [1] 1 2 3 4 5 6 7 8 9 10 NA ``` ```r map_dbl(x, mean, na.rm = TRUE) ``` ``` ## [1] 3.0 5.5 ``` --- class: left, hide_logo ### A typical use case Fitting a model to subgroups and extracting model coefficients ```r mtcars %>% group_by(cyl) %>% nest() ``` ``` ## # A tibble: 3 × 2 ## # Groups: cyl [3] ## cyl data ## <dbl> <list> ## 1 6 <tibble [7 × 10]> ## 2 4 <tibble [11 × 10]> ## 3 8 <tibble [14 × 10]> ``` --- class: left, hide_logo ### Each `<tibble>` is a dataframe .pull-left[ <img src="img/nested1.png"> ] .pull-right[ <img src="img/nested2.png"> ] --- class: left, hide_logo ### Next map over each dataframe and fit a model ```r mtcars %>% group_by(cyl) %>% nest() %>% mutate(results = map(.x = data, ~ lm(mpg ~ wt, data = .x) %>% tidy())) ``` ``` ## # A tibble: 3 × 3 ## # Groups: cyl [3] ## cyl data results ## <dbl> <list> <list> ## 1 6 <tibble [7 × 10]> <tibble [2 × 5]> ## 2 4 <tibble [11 × 10]> <tibble [2 × 5]> ## 3 8 <tibble [14 × 10]> <tibble [2 × 5]> ``` --- class: left, hide_logo ### More nesting .pull-left[ <img src="img/nested3.png"> ] .pull-right[ <img src="img/nested4.png"> ] --- class: left, hide_logo ### Unnest the results ```r mtcars %>% group_by(cyl) %>% nest() %>% mutate(results = map(.x = data, ~ lm(mpg ~ wt, data = .x) %>% tidy())) %>% unnest(cols = results) ``` ``` ## # A tibble: 6 × 7 ## # Groups: cyl [3] ## cyl data term estimate std.error statistic p.value ## <dbl> <list> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 6 <tibble [7 × 10]> (Intercept) 28.4 4.18 6.79 0.00105 ## 2 6 <tibble [7 × 10]> wt -2.78 1.33 -2.08 0.0918 ## 3 4 <tibble [11 × 10]> (Intercept) 39.6 4.35 9.10 0.00000777 ## 4 4 <tibble [11 × 10]> wt -5.65 1.85 -3.05 0.0137 ## 5 8 <tibble [14 × 10]> (Intercept) 23.9 3.01 7.94 0.00000405 ## 6 8 <tibble [14 × 10]> wt -2.19 0.739 -2.97 0.0118 ``` --- class: left, hide_logo ### Filter to just the coefficients we want ```r mtcars %>% group_by(cyl) %>% nest() %>% mutate(results = map(.x = data, ~ lm(mpg ~ wt, data = .x) %>% tidy())) %>% unnest(cols = results) %>% filter(term=="wt") ``` ``` ## # A tibble: 3 × 7 ## # Groups: cyl [3] ## cyl data term estimate std.error statistic p.value ## <dbl> <list> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 6 <tibble [7 × 10]> wt -2.78 1.33 -2.08 0.0918 ## 2 4 <tibble [11 × 10]> wt -5.65 1.85 -3.05 0.0137 ## 3 8 <tibble [14 × 10]> wt -2.19 0.739 -2.97 0.0118 ``` --- class: left, hide-count # Credits Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) * {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} * [R for Data Science](https://r4ds.had.co.nz/index.html), by Wickham and Grolemund * [Advanced R](https://adv-r.hadley.nz/index.html), by Wickham * [Allison Horst's](https://github.com/allisonhorst/stats-illustrations) illustrations