class: titleSlide, hide_logo # Data Wrangling ## Tidy data <br> <center><img src="data:image/png;base64,#logo.png" width="200px"/></center> --- class: left, hide-count ## Tidy data 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. <center><img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png"/></center> --- class: left, hide-count ### Why tidy? 1. Consistent storage format makes it easier to work with tools that understand this format (e.g., the tidyverse). 2. Variables in columns makes it easy to do many operations in R that are vectorized (i.e., operate on vectors of values) --- class: left, hide-count ### Application * Return to the [DPS COVID Dashboard](https://www.dpsnc.net/Page/5339) I introduced last session. This [link](https://docs.google.com/spreadsheets/d/1kQ35JKpk3vNaM7ZfFK2DFb1B9kQIQ4IODr7lOM394nQ/edit?usp=sharing) will jump you straight to the Google Sheet with the underlying data. Open the `dps.Rmd` file in `wrangling2`. * Complete steps 1-4 * For step 4: In Google Sheets use File > Make a Copy to make a copy of the sheet in your Google Drive account. Then create a new sheet in this copy and design a tidy version of this dataset. It's not essential that you capture all of the values, just enough to show your idea. --- class: left, hide-count ### Reshaping is hard in any language .pull-left[ <blockquote class="twitter-tweet"><p lang="en" dir="ltr">When I get too full of myself with my book, it is good to have a comforting reminder from <a href="https://twitter.com/Stata?ref_src=twsrc%5Etfw">@Stata</a> that no matter how successful I am, I will still have to type "help reshape" every single time.</p>— ProfEmilyOster (@ProfEmilyOster) <a href="https://twitter.com/ProfEmilyOster/status/1128384932093599744?ref_src=twsrc%5Etfw">May 14, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] .pull-right[ <blockquote class="twitter-tweet"><p lang="en" dir="ltr">successfully used pivot_longer() for the first time and had no errors AMA <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a></p>— Andrew Heiss (@andrewheiss) <a href="https://twitter.com/andrewheiss/status/1174167010974019584?ref_src=twsrc%5Etfw">September 18, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] --- class: left, hide-count ### Watch out for legacy code When looking for help reshaping data, you will likely find references to `melt()`/`dcast()` (a reference to melting and casting metal, aka naming stuff is hard) and `gather()`/`spread()`. Go with `pivot_longer()`/`pivot/wider()`. --- class: left, hide-count ### `pivot_longer()` When column names contain values, you need to pivot longer: <center><img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/3aea19108d39606bbe49981acda07696c0c7fcd8/2de65/images/tidy-9.png"/></center> --- class: left, hide-count ### String data in column names How many variables do you see? ```r relig_income ``` ``` ## # A tibble: 18 × 11 ## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Agnostic 27 34 60 81 76 137 122 ## 2 Atheist 12 27 37 52 35 70 73 ## 3 Buddhist 27 21 30 34 33 58 62 ## 4 Catholic 418 617 732 670 638 1116 949 ## 5 Don’t k… 15 14 15 11 10 35 21 ## 6 Evangel… 575 869 1064 982 881 1486 949 ## 7 Hindu 1 9 7 9 11 34 47 ## 8 Histori… 228 244 236 238 197 223 131 ## 9 Jehovah… 20 27 24 24 21 30 15 ## 10 Jewish 19 19 25 25 30 95 69 ## 11 Mainlin… 289 495 619 655 651 1107 939 ## 12 Mormon 29 40 48 51 56 112 85 ## 13 Muslim 6 7 9 10 9 23 16 ## 14 Orthodox 13 17 23 32 32 47 38 ## 15 Other C… 9 7 11 13 13 14 18 ## 16 Other F… 20 33 40 46 49 63 46 ## 17 Other W… 5 2 3 4 2 7 3 ## 18 Unaffil… 217 299 374 365 341 528 407 ## # … with 3 more variables: $100-150k <dbl>, >150k <dbl>, ## # Don't know/refused <dbl> ``` --- class: left, hide-count ### Make a plan to pivot longer <img src="data:image/png;base64,#img/pivot_religion.png" width="100%" /> --- class: left, hide-count ### Tidy with `pivot_longer()` .panelset[ .panel[.panel-name[Code] `income` and `count` are new variables, so surround them with quotes. Could have picked any names. ```r relig_income %>% pivot_longer(!religion, # which columns to pivot? names_to = "income", # new variable name for income brackets values_to = "count") # new variable name for counts ``` ] .panel[.panel-name[Result] ``` ## # A tibble: 180 × 3 ## religion income count ## <chr> <chr> <dbl> ## 1 Agnostic <$10k 27 ## 2 Agnostic $10-20k 34 ## 3 Agnostic $20-30k 60 ## 4 Agnostic $30-40k 81 ## 5 Agnostic $40-50k 76 ## 6 Agnostic $50-75k 137 ## 7 Agnostic $75-100k 122 ## 8 Agnostic $100-150k 109 ## 9 Agnostic >150k 84 ## 10 Agnostic Don't know/refused 96 ## # … with 170 more rows ``` ] ] --- class: left, hide-count ### Numeric data in column names ```r billboard ``` ``` ## # A tibble: 317 × 79 ## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 ## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2 Pac Baby D… 2000-02-26 87 82 72 77 87 94 99 NA ## 2 2Ge+her The Ha… 2000-09-02 91 87 92 NA NA NA NA NA ## 3 3 Doors… Krypto… 2000-04-08 81 70 68 67 66 57 54 53 ## 4 3 Doors… Loser 2000-10-21 76 76 72 69 67 65 55 59 ## 5 504 Boyz Wobble… 2000-04-15 57 34 25 17 17 31 36 49 ## 6 98^0 Give M… 2000-08-19 51 39 34 26 26 19 2 2 ## 7 A*Teens Dancin… 2000-07-08 97 97 96 95 100 NA NA NA ## 8 Aaliyah I Don'… 2000-01-29 84 62 51 41 38 35 35 38 ## 9 Aaliyah Try Ag… 2000-03-18 59 53 38 28 21 18 16 14 ## 10 Adams, … Open M… 2000-08-26 76 76 74 69 68 67 61 58 ## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>, ## # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, ## # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, ## # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, ## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, ## # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, ## # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, … ``` --- class: left, hide-count ### Tidy with `pivot_longer()` .panelset[ .panel[.panel-name[Think First] ```r billboard %>% pivot_longer( cols = ___, names_to = "week", values_to = "rank", values_drop_na = TRUE ) ``` ] .panel[.panel-name[Code] ```r billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", values_to = "rank", values_drop_na = TRUE ) ``` ] .panel[.panel-name[Results] ``` ## # A tibble: 5,307 × 5 ## artist track date.entered week rank ## <chr> <chr> <date> <chr> <dbl> ## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 ## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82 ## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72 ## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77 ## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87 ## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94 ## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99 ## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91 ## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87 ## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92 ## # … with 5,297 more rows ``` ] ] --- class: left, hide-count ### `week` is a string, but we want a dbl .panelset[ .panel[.panel-name[Code] ```r billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", values_to = "rank", values_drop_na = TRUE ) %>% mutate(week = stringr::str_remove(week, "wk"), week = as.numeric(week)) ``` ] .panel[.panel-name[Results] ``` ## # A tibble: 5,307 × 5 ## artist track date.entered week rank ## <chr> <chr> <date> <dbl> <dbl> ## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87 ## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82 ## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72 ## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77 ## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87 ## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94 ## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99 ## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91 ## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87 ## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92 ## # … with 5,297 more rows ``` ] ] --- class: left, hide-count ### When we get to strings, we'll tackle more complex patterns ```r who ``` ``` ## # A tibble: 7,240 × 60 ## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 ## <chr> <chr> <chr> <int> <int> <int> <int> <int> ## 1 Afghani… AF AFG 1980 NA NA NA NA ## 2 Afghani… AF AFG 1981 NA NA NA NA ## 3 Afghani… AF AFG 1982 NA NA NA NA ## 4 Afghani… AF AFG 1983 NA NA NA NA ## 5 Afghani… AF AFG 1984 NA NA NA NA ## 6 Afghani… AF AFG 1985 NA NA NA NA ## 7 Afghani… AF AFG 1986 NA NA NA NA ## 8 Afghani… AF AFG 1987 NA NA NA NA ## 9 Afghani… AF AFG 1988 NA NA NA NA ## 10 Afghani… AF AFG 1989 NA NA NA NA ## # … with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>, ## # new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>, ## # new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>, ## # new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>, ## # new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>, ## # new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>, ## # new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>, … ``` --- class: left, hide-count ### `pivot_wider()` When observations are scattered across rows, use `pivot_wider()`: <center><img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/8350f0dda414629b9d6c354f87acf5c5f722be43/bcb84/images/tidy-8.png"/></center> --- class: left, hide-count ### Long(er) formats are often good for plotting and certain models ``` ## # A tibble: 80 × 5 ## pid round group high_risk score ## <dbl> <chr> <chr> <chr> <int> ## 1 1 r_1 treatment Yes 10 ## 2 2 r_1 treatment Yes 9 ## 3 3 r_1 treatment Yes 8 ## 4 4 r_1 treatment Yes 9 ## 5 5 r_1 treatment Yes 10 ## 6 6 r_1 treatment No 7 ## 7 7 r_1 treatment No 9 ## 8 8 r_1 treatment No 7 ## 9 9 r_1 treatment No 8 ## 10 10 r_1 treatment No 9 ## # … with 70 more rows ``` --- class: left, hide-count ### Long(er) formats are often good for plotting and certain models ```r long %>% ggplot(aes(x=round, y=score, fill=group)) + geom_col(position = "dodge") + facet_wrap(~high_risk) + theme_bw() ``` <img src="data:image/png;base64,#wrangling3_files/figure-html/unnamed-chunk-14-1.png" width="100%" /> --- class: left, hide-count ### Sometimes you might want to pivot wider .panelset[ .panel[.panel-name[Think First] ```r long %>% pivot_wider(names_from = ____, values_from = _____) ``` ] .panel[.panel-name[Code] ```r long %>% pivot_wider(names_from = round, values_from = score) ``` ] .panel[.panel-name[Results] ``` ## # A tibble: 20 × 7 ## pid group high_risk r_1 r_2 r_3 r_4 ## <dbl> <chr> <chr> <int> <int> <int> <int> ## 1 1 treatment Yes 10 8 8 6 ## 2 2 treatment Yes 9 6 10 7 ## 3 3 treatment Yes 8 8 8 5 ## 4 4 treatment Yes 9 8 9 7 ## 5 5 treatment Yes 10 8 8 5 ## 6 6 treatment No 7 5 9 8 ## 7 7 treatment No 9 10 10 6 ## 8 8 treatment No 7 7 9 9 ## 9 9 treatment No 8 6 9 7 ## 10 10 treatment No 9 9 9 8 ## 11 11 control Yes 6 0 6 3 ## 12 12 control Yes 4 2 2 4 ## 13 13 control Yes 5 6 5 0 ## 14 14 control Yes 2 5 4 6 ## 15 15 control Yes 6 3 2 4 ## 16 16 control No 0 5 0 0 ## 17 17 control No 3 5 5 2 ## 18 18 control No 4 0 2 5 ## 19 19 control No 1 5 6 0 ## 20 20 control No 2 2 3 4 ``` ] ] --- class: left, hide-count # Credits Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) * {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} * [R for Data Science](https://r4ds.had.co.nz/index.html), by Wickham and Grolemund