class: titleSlide, hide_logo # Data Wrangling ## Data transformation <br> <center><img src="data:image/png;base64,#logo.png" width="200px"/></center> --- class: newTopicSub, hide_logo # `dplyr` <center> <img src="img/dplyrhex.png" width="350px" style="border: 0;"> </center> --- class: left, hide-count ## `dplyr` <center> <img src="img/dplyr.png" width="900px" style="border: 0;"> </center> --- class: left, hide-count ## `filter(data, v2 != 0)` <center> <img src="img/filter.png" style="border: 0;"> </center> picks cases based on their values --- class: left, hide-count ### `filter()` .panelset[ .panel[.panel-name[Data] ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] .panel[.panel-name[Code] ```r filter(flights, month == 1, day == 1) # equivalent to: # filter(flights, month==1 & day == 1) # flights %>% filter(month == 1, day == 1) ``` ] .panel[.panel-name[Result] ``` ## # A tibble: 842 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # … with 832 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] .panel[.panel-name[Notes] **Comparison operators:** `>`, `>=`, `<`, `<=`, `==` (equal to), `!=` (not equal to), `%in%` **Logical operators:** `|` (OR), `&` (AND), `!` (NOT) **Is missing:** `filter(is.na(variable))` **Is NOT missing:** `filter(!is.na(variable))` ] ] --- class: left, hide-count ## `arrange(data, v1)` <center> <img src="img/arrange.png" style="border: 0;"> </center> changes the ordering of the rows --- class: left, hide-count ### `arrange()` ```r arrange(flights, desc(year), month, day) ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ```r # flights %>% arrange(desc(year), month, day) ``` --- class: left, hide-count ## `select(data, -v3)` <center> <img src="img/select.png" style="border: 0;"> </center> picks variables based on their names --- class: left, hide-count ### `select()` .panelset[ .panel[.panel-name[Include] ```r select(flights, year, month, day) ``` ``` ## # A tibble: 336,776 × 3 ## year month day ## <int> <int> <int> ## 1 2013 1 1 ## 2 2013 1 1 ## 3 2013 1 1 ## 4 2013 1 1 ## 5 2013 1 1 ## 6 2013 1 1 ## 7 2013 1 1 ## 8 2013 1 1 ## 9 2013 1 1 ## 10 2013 1 1 ## # … with 336,766 more rows ``` ```r # select(flights, year:day) ``` ] .panel[.panel-name[Exclude] ```r select(flights, -(year:day)) ``` ``` ## # A tibble: 336,776 × 16 ## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ## <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 517 515 2 830 819 11 UA ## 2 533 529 4 850 830 20 UA ## 3 542 540 2 923 850 33 AA ## 4 544 545 -1 1004 1022 -18 B6 ## 5 554 600 -6 812 837 -25 DL ## 6 554 558 -4 740 728 12 UA ## 7 555 600 -5 913 854 19 B6 ## 8 557 600 -3 709 723 -14 EV ## 9 557 600 -3 838 846 -8 B6 ## 10 558 600 -2 753 745 8 AA ## # … with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` ```r # select(flights, -year, -month, -day) # could also name every other column except these, but why ``` ] .panel[.panel-name[Helpers] There are a number of helper functions you can use within `select()`: * `starts_with("abc")`: matches names that begin with "abc". * `ends_with("xyz")`: matches names that end with "xyz". * `contains("ijk")`: matches names that contain "ijk". * `matches("(.)\\1")`: selects variables that match a regular expression. This one matches any variables that contain repeated characters. You'll learn more about regular expressions in strings. * `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`. * `everything()`: refers to all columns not explictly named ] ] --- class: left, hide-count ## `mutate(data, tot = v1 + v2 + v3)` <center> <img src="img/mutate.png" style="border: 0;"> </center> adds (or modifies) variables that are functions of existing variables --- class: left, hide-count ### `mutate()` ```r flights %>% select(year:day, ends_with("delay"), air_time) %>% mutate( gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` ``` ## # A tibble: 336,776 × 9 ## year month day dep_delay arr_delay air_time gain hours gain_per_hour ## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2013 1 1 2 11 227 -9 3.78 -2.38 ## 2 2013 1 1 4 20 227 -16 3.78 -4.23 ## 3 2013 1 1 2 33 160 -31 2.67 -11.6 ## 4 2013 1 1 -1 -18 183 17 3.05 5.57 ## 5 2013 1 1 -6 -25 116 19 1.93 9.83 ## 6 2013 1 1 -4 12 150 -16 2.5 -6.4 ## 7 2013 1 1 -5 19 158 -24 2.63 -9.11 ## 8 2013 1 1 -3 -14 53 11 0.883 12.5 ## 9 2013 1 1 -3 -8 140 5 2.33 2.14 ## 10 2013 1 1 -2 8 138 -10 2.3 -4.35 ## # … with 336,766 more rows ``` --- class: left, hide-count ## `summarize(data, v1 = mean(v1))` <center> <img src="img/summarize.png" style="border: 0;"> </center> reduces multiple values down to a single summary --- class: left, hide-count ### `summarize()` ```r flights %>% group_by(year, month, day) %>% summarise(delay = mean(dep_delay, na.rm = TRUE)) ``` ``` ## # A tibble: 365 × 4 ## # Groups: year, month [12] ## year month day delay ## <int> <int> <int> <dbl> ## 1 2013 1 1 11.5 ## 2 2013 1 2 13.9 ## 3 2013 1 3 11.0 ## 4 2013 1 4 8.95 ## 5 2013 1 5 5.73 ## 6 2013 1 6 7.15 ## 7 2013 1 7 5.42 ## 8 2013 1 8 2.55 ## 9 2013 1 9 2.28 ## 10 2013 1 10 2.84 ## # … with 355 more rows ``` --- class: left, hide-count ## `distinct(data, v1)` <center> <img src="img/distinct.png" style="border: 0;"> </center> keeps unique rows --- class: left, hide-count ### `distinct()` .panelset[ .panel[.panel-name[Inspect data] ```r flights %>% select(time_hour, tailnum, arr_delay) %>% arrange(tailnum, desc(time_hour)) ``` ``` ## # A tibble: 336,776 × 3 ## time_hour tailnum arr_delay ## <dttm> <chr> <dbl> ## 1 2013-07-05 12:00:00 D942DN -11 ## 2 2013-03-24 08:00:00 D942DN 2 ## 3 2013-03-23 13:00:00 D942DN 44 ## 4 2013-02-11 14:00:00 D942DN 91 ## 5 2013-12-31 16:00:00 N0EGMQ 122 ## 6 2013-12-30 17:00:00 N0EGMQ 27 ## 7 2013-12-30 11:00:00 N0EGMQ 32 ## 8 2013-12-27 18:00:00 N0EGMQ -11 ## 9 2013-12-27 12:00:00 N0EGMQ -3 ## 10 2013-12-25 07:00:00 N0EGMQ 2 ## # … with 336,766 more rows ``` ] .panel[.panel-name[Default] ```r flights %>% select(time_hour, tailnum, arr_delay) %>% arrange(tailnum, desc(time_hour)) %>% distinct(tailnum) ``` ``` ## # A tibble: 4,044 × 1 ## tailnum ## <chr> ## 1 D942DN ## 2 N0EGMQ ## 3 N10156 ## 4 N102UW ## 5 N103US ## 6 N104UW ## 7 N10575 ## 8 N105UW ## 9 N107US ## 10 N108UW ## # … with 4,034 more rows ``` ] .panel[.panel-name[`.keep_all=TRUE`] ```r flights %>% select(time_hour, tailnum, arr_delay) %>% arrange(tailnum, desc(time_hour)) %>% distinct(tailnum, .keep_all = TRUE) ``` ``` ## # A tibble: 4,044 × 3 ## time_hour tailnum arr_delay ## <dttm> <chr> <dbl> ## 1 2013-07-05 12:00:00 D942DN -11 ## 2 2013-12-31 16:00:00 N0EGMQ 122 ## 3 2013-12-31 10:00:00 N10156 2 ## 4 2013-12-20 15:00:00 N102UW -4 ## 5 2013-12-15 10:00:00 N103US -20 ## 6 2013-12-29 15:00:00 N104UW 2 ## 7 2013-12-31 15:00:00 N10575 19 ## 8 2013-12-29 10:00:00 N105UW 20 ## 9 2013-12-25 15:00:00 N107US -9 ## 10 2013-12-23 15:00:00 N108UW 50 ## # … with 4,034 more rows ``` ] ] --- class: left, hide-count # Credits Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) * {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} * [R for Data Science](https://r4ds.had.co.nz/index.html), by Wickham and Grolemund