Data Wrangling

# Data Wrangling

## Data transformation

---

# `dplyr`

---
class: left, hide-count

## `dplyr`

---
class: left, hide-count

## `filter(data, v2 != 0)`

picks cases based on their values

---
class: left, hide-count

### `filter()`

```
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

]

```r
  filter(flights, month == 1, day == 1)

# equivalent to:
#   filter(flights, month==1 & day == 1)
#   flights %>% filter(month == 1, day == 1)
```

]

```
## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 832 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

]

**Comparison operators:**
`>`, `>=`, `<`, `<=`, `==` (equal to), `!=` (not equal to), `%in%`

**Logical operators:**
`|` (OR), `&` (AND), `!` (NOT)

**Is missing:**
`filter(is.na(variable))`

**Is NOT missing:**
`filter(!is.na(variable))`

]
]

---
class: left, hide-count

## `arrange(data, v1)`

changes the ordering of the rows

---
class: left, hide-count

### `arrange()`

```r
arrange(flights, desc(year), month, day)
```

```r
# flights %>% arrange(desc(year), month, day)
```

---
class: left, hide-count

## `select(data, -v3)`

picks variables based on their names

---
class: left, hide-count

### `select()`

```r
select(flights, year, month, day)
```

```
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows
```

```r
# select(flights, year:day)
```
]

```r
select(flights, -(year:day))
```

```
## # A tibble: 336,776 × 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # … with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
```

```r
# select(flights, -year, -month, -day)
# could also name every other column except these, but why
```
]

There are a number of helper functions you can use within `select()`:

* `starts_with("abc")`: matches names that begin with "abc".
* `ends_with("xyz")`: matches names that end with "xyz".
* `contains("ijk")`: matches names that contain "ijk".
* `matches("(.)\\1")`: selects variables that match a regular expression. This one matches any variables that contain repeated characters. You'll learn more about regular expressions in strings.
* `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
* `everything()`: refers to all columns not explictly named
]

]

---

adds (or modifies) variables that are functions of existing variables

---
class: left, hide-count

### `mutate()`

```r
flights %>%
  select(year:day, ends_with("delay"), air_time) %>% 
  mutate(
    gain = dep_delay - arr_delay,
    hours = air_time / 60,
    gain_per_hour = gain / hours
  )
```

```
## # A tibble: 336,776 × 9
##     year month   day dep_delay arr_delay air_time  gain hours gain_per_hour
##    <int> <int> <int>     <dbl>     <dbl>    <dbl> <dbl> <dbl>         <dbl>
##  1  2013     1     1         2        11      227    -9 3.78          -2.38
##  2  2013     1     1         4        20      227   -16 3.78          -4.23
##  3  2013     1     1         2        33      160   -31 2.67         -11.6 
##  4  2013     1     1        -1       -18      183    17 3.05           5.57
##  5  2013     1     1        -6       -25      116    19 1.93           9.83
##  6  2013     1     1        -4        12      150   -16 2.5           -6.4 
##  7  2013     1     1        -5        19      158   -24 2.63          -9.11
##  8  2013     1     1        -3       -14       53    11 0.883         12.5 
##  9  2013     1     1        -3        -8      140     5 2.33           2.14
## 10  2013     1     1        -2         8      138   -10 2.3           -4.35
## # … with 336,766 more rows
```

---
class: left, hide-count

## `summarize(data, v1 = mean(v1))`

reduces multiple values down to a single summary

---
class: left, hide-count

### `summarize()`

```r
flights %>% 
  group_by(year, month, day) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE))
```

```
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # … with 355 more rows
```

---
class: left, hide-count

## `distinct(data, v1)`

keeps unique rows

---
class: left, hide-count

### `distinct()`

```r
flights %>% 
  select(time_hour, tailnum, arr_delay) %>%
  arrange(tailnum, desc(time_hour))
```

```
## # A tibble: 336,776 × 3
##    time_hour           tailnum arr_delay
##    <dttm>              <chr>       <dbl>
##  1 2013-07-05 12:00:00 D942DN        -11
##  2 2013-03-24 08:00:00 D942DN          2
##  3 2013-03-23 13:00:00 D942DN         44
##  4 2013-02-11 14:00:00 D942DN         91
##  5 2013-12-31 16:00:00 N0EGMQ        122
##  6 2013-12-30 17:00:00 N0EGMQ         27
##  7 2013-12-30 11:00:00 N0EGMQ         32
##  8 2013-12-27 18:00:00 N0EGMQ        -11
##  9 2013-12-27 12:00:00 N0EGMQ         -3
## 10 2013-12-25 07:00:00 N0EGMQ          2
## # … with 336,766 more rows
```

]

```r
flights %>% 
  select(time_hour, tailnum, arr_delay) %>%
  arrange(tailnum, desc(time_hour)) %>%
  distinct(tailnum)
```

```
## # A tibble: 4,044 × 1
##    tailnum
##    <chr>  
##  1 D942DN 
##  2 N0EGMQ 
##  3 N10156 
##  4 N102UW 
##  5 N103US 
##  6 N104UW 
##  7 N10575 
##  8 N105UW 
##  9 N107US 
## 10 N108UW 
## # … with 4,034 more rows
```
]

```r
flights %>% 
  select(time_hour, tailnum, arr_delay) %>%
  arrange(tailnum, desc(time_hour)) %>%
  distinct(tailnum, .keep_all = TRUE)
```

```
## # A tibble: 4,044 × 3
##    time_hour           tailnum arr_delay
##    <dttm>              <chr>       <dbl>
##  1 2013-07-05 12:00:00 D942DN        -11
##  2 2013-12-31 16:00:00 N0EGMQ        122
##  3 2013-12-31 10:00:00 N10156          2
##  4 2013-12-20 15:00:00 N102UW         -4
##  5 2013-12-15 10:00:00 N103US        -20
##  6 2013-12-29 15:00:00 N104UW          2
##  7 2013-12-31 15:00:00 N10575         19
##  8 2013-12-29 10:00:00 N105UW         20
##  9 2013-12-25 15:00:00 N107US         -9
## 10 2013-12-23 15:00:00 N108UW         50
## # … with 4,034 more rows
```
]
]

---
class: left, hide-count

# Credits

Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)

* {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} 
* [R for Data Science](https://r4ds.had.co.nz/index.html), by Wickham and Grolemund