Data Wrangling

class: titleSlide, hide_logo

# Data Wrangling

## Tidy data

---
class: left, hide-count

## Tidy data

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

<center><img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png"/></center>
---
class: left, hide-count

### Why tidy?

1. Consistent storage format makes it easier to work with tools that understand this format (e.g., the tidyverse).
2. Variables in columns makes it easy to do many operations in R that are vectorized (i.e., operate on vectors of values)

---

class: left, hide-count

### Application

* Return to the [DPS COVID Dashboard](https://www.dpsnc.net/Page/5339) I introduced last session. This [link](https://docs.google.com/spreadsheets/d/1kQ35JKpk3vNaM7ZfFK2DFb1B9kQIQ4IODr7lOM394nQ/edit?usp=sharing) will jump you straight to the Google Sheet with the underlying data. Open the `dps.Rmd` file in `wrangling2`.
* Complete steps 1-4
* For step 4: In Google Sheets use File > Make a Copy to make a copy of the sheet in your Google Drive account. Then create a new sheet in this copy and design a tidy version of this dataset. It's not essential that you capture all of the values, just enough to show your idea.

---
class: left, hide-count

### Reshaping is hard in any language

.pull-left[

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">When I get too full of myself with my book, it is good to have a comforting reminder from <a href="https://twitter.com/Stata?ref_src=twsrc%5Etfw">@Stata</a> that no matter how successful I am, I will still have to type &quot;help reshape&quot; every single time.</p>&mdash; ProfEmilyOster (@ProfEmilyOster) <a href="https://twitter.com/ProfEmilyOster/status/1128384932093599744?ref_src=twsrc%5Etfw">May 14, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

]
.pull-right[
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">successfully used pivot_longer() for the first time and had no errors AMA <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a></p>&mdash; Andrew Heiss (@andrewheiss) <a href="https://twitter.com/andrewheiss/status/1174167010974019584?ref_src=twsrc%5Etfw">September 18, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
]

---
class: left, hide-count

### Watch out for legacy code

When looking for help reshaping data, you will likely find references to `melt()`/`dcast()` (a reference to melting and casting metal, aka naming stuff is hard) and `gather()`/`spread()`. Go with `pivot_longer()`/`pivot/wider()`.

---
class: left, hide-count

### `pivot_longer()`

When column names contain values, you need to pivot longer:

---
class: left, hide-count

### String data in column names

How many variables do you see?

```r
relig_income
```

```
## # A tibble: 18 × 11
##    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
##  1 Agnostic      27        34        60        81        76       137        122
##  2 Atheist       12        27        37        52        35        70         73
##  3 Buddhist      27        21        30        34        33        58         62
##  4 Catholic     418       617       732       670       638      1116        949
##  5 Don’t k…      15        14        15        11        10        35         21
##  6 Evangel…     575       869      1064       982       881      1486        949
##  7 Hindu          1         9         7         9        11        34         47
##  8 Histori…     228       244       236       238       197       223        131
##  9 Jehovah…      20        27        24        24        21        30         15
## 10 Jewish        19        19        25        25        30        95         69
## 11 Mainlin…     289       495       619       655       651      1107        939
## 12 Mormon        29        40        48        51        56       112         85
## 13 Muslim         6         7         9        10         9        23         16
## 14 Orthodox      13        17        23        32        32        47         38
## 15 Other C…       9         7        11        13        13        14         18
## 16 Other F…      20        33        40        46        49        63         46
## 17 Other W…       5         2         3         4         2         7          3
## 18 Unaffil…     217       299       374       365       341       528        407
## # … with 3 more variables: $100-150k <dbl>, >150k <dbl>,
## #   Don't know/refused <dbl>
```

---
class: left, hide-count

### Make a plan to pivot longer

<img src="data:image/png;base64,#img/pivot_religion.png" width="100%" />
---
class: left, hide-count

### Tidy with `pivot_longer()`

.panelset[
.panel[.panel-name[Code]

`income` and `count` are new variables, so surround them with quotes. Could have picked any names.

```r
relig_income %>% 
  pivot_longer(!religion,            # which columns to pivot?
               names_to = "income",  # new variable name for income brackets
               values_to = "count")  # new variable name for counts
```
]

.panel[.panel-name[Result]

```
## # A tibble: 180 × 3
##    religion income             count
##    <chr>    <chr>              <dbl>
##  1 Agnostic <$10k                 27
##  2 Agnostic $10-20k               34
##  3 Agnostic $20-30k               60
##  4 Agnostic $30-40k               81
##  5 Agnostic $40-50k               76
##  6 Agnostic $50-75k              137
##  7 Agnostic $75-100k             122
##  8 Agnostic $100-150k            109
##  9 Agnostic >150k                 84
## 10 Agnostic Don't know/refused    96
## # … with 170 more rows
```
]
]
---
class: left, hide-count

### Numeric data in column names

```r
billboard
```

```
## # A tibble: 317 × 79
##    artist   track   date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>    <chr>   <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac    Baby D… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her  The Ha… 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors… Krypto… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors… Loser   2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz Wobble… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0     Give M… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens  Dancin… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah  I Don'… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah  Try Ag… 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, … Open M… 2000-08-26      76    76    74    69    68    67    61    58
## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## #   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## #   wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## #   wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## #   wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## #   wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## #   wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, …
```

---
class: left, hide-count

### Tidy with `pivot_longer()`

.panelset[
.panel[.panel-name[Think First]

```r
billboard %>% 
  pivot_longer(
    cols = ___, 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  )
```

]
.panel[.panel-name[Code]

```r
billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  )
```

]

.panel[.panel-name[Results]

```
## # A tibble: 5,307 × 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
## # … with 5,297 more rows
```

]

---
class: left, hide-count

### `week` is a string, but we want a dbl

.panelset[
.panel[.panel-name[Code]

```r
billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  ) %>%
  mutate(week = stringr::str_remove(week, "wk"),
         week = as.numeric(week))
```

]
.panel[.panel-name[Results]

```
## # A tibble: 5,307 × 5
##    artist  track                   date.entered  week  rank
##    <chr>   <chr>                   <date>       <dbl> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26       1    87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26       2    82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26       3    72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26       4    77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26       5    87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26       6    94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26       7    99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02       1    91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02       2    87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02       3    92
## # … with 5,297 more rows
```

]
]
---
class: left, hide-count

### When we get to strings, we'll tackle more complex patterns

```r
who
```

```
## # A tibble: 7,240 × 60
##    country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##    <chr>    <chr> <chr> <int>       <int>        <int>        <int>        <int>
##  1 Afghani… AF    AFG    1980          NA           NA           NA           NA
##  2 Afghani… AF    AFG    1981          NA           NA           NA           NA
##  3 Afghani… AF    AFG    1982          NA           NA           NA           NA
##  4 Afghani… AF    AFG    1983          NA           NA           NA           NA
##  5 Afghani… AF    AFG    1984          NA           NA           NA           NA
##  6 Afghani… AF    AFG    1985          NA           NA           NA           NA
##  7 Afghani… AF    AFG    1986          NA           NA           NA           NA
##  8 Afghani… AF    AFG    1987          NA           NA           NA           NA
##  9 Afghani… AF    AFG    1988          NA           NA           NA           NA
## 10 Afghani… AF    AFG    1989          NA           NA           NA           NA
## # … with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>,
## #   new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>,
## #   new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
## #   new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
## #   new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
## #   new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
## #   new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>, …
```

---
class: left, hide-count

### `pivot_wider()`

When observations are scattered across rows, use `pivot_wider()`:

---
class: left, hide-count

### Long(er) formats are often good for plotting and certain models

```
## # A tibble: 80 × 5
##      pid round group     high_risk score
##    <dbl> <chr> <chr>     <chr>     <int>
##  1     1 r_1   treatment Yes          10
##  2     2 r_1   treatment Yes           9
##  3     3 r_1   treatment Yes           8
##  4     4 r_1   treatment Yes           9
##  5     5 r_1   treatment Yes          10
##  6     6 r_1   treatment No            7
##  7     7 r_1   treatment No            9
##  8     8 r_1   treatment No            7
##  9     9 r_1   treatment No            8
## 10    10 r_1   treatment No            9
## # … with 70 more rows
```

---
class: left, hide-count

### Long(er) formats are often good for plotting and certain models

```r
long %>%
  ggplot(aes(x=round, y=score, fill=group)) +
    geom_col(position = "dodge") + facet_wrap(~high_risk) + theme_bw()
```

<img src="data:image/png;base64,#wrangling3_files/figure-html/unnamed-chunk-14-1.png" width="100%" />
---
class: left, hide-count

### Sometimes you might want to pivot wider

.panelset[
.panel[.panel-name[Think First]

```r
long %>%
  pivot_wider(names_from = ____, 
              values_from = _____) 
```

]
.panel[.panel-name[Code]

```r
long %>%
  pivot_wider(names_from = round, 
              values_from = score) 
```

]
.panel[.panel-name[Results]

```
## # A tibble: 20 × 7
##      pid group     high_risk   r_1   r_2   r_3   r_4
##    <dbl> <chr>     <chr>     <int> <int> <int> <int>
##  1     1 treatment Yes          10     8     8     6
##  2     2 treatment Yes           9     6    10     7
##  3     3 treatment Yes           8     8     8     5
##  4     4 treatment Yes           9     8     9     7
##  5     5 treatment Yes          10     8     8     5
##  6     6 treatment No            7     5     9     8
##  7     7 treatment No            9    10    10     6
##  8     8 treatment No            7     7     9     9
##  9     9 treatment No            8     6     9     7
## 10    10 treatment No            9     9     9     8
## 11    11 control   Yes           6     0     6     3
## 12    12 control   Yes           4     2     2     4
## 13    13 control   Yes           5     6     5     0
## 14    14 control   Yes           2     5     4     6
## 15    15 control   Yes           6     3     2     4
## 16    16 control   No            0     5     0     0
## 17    17 control   No            3     5     5     2
## 18    18 control   No            4     0     2     5
## 19    19 control   No            1     5     6     0
## 20    20 control   No            2     2     3     4
```

]
]

---
class: left, hide-count

# Credits

Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)

* {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} 
* [R for Data Science](https://r4ds.had.co.nz/index.html), by Wickham and Grolemund