class: titleSlide, hide_logo # Data Wrangling ## Strings <br> <center><img src="data:image/png;base64,#logo.png" width="200px"/></center> --- class: left, hide-count ### Me, every time I work with strings. Every time. <center><img src="data:image/png;base64,#https://media.giphy.com/media/YpmVBNubONoqs/giphy.gif"/></center> --- class: newTopicSub, hide_logo # Why do we care? --- class: left, hide_logo ### Cleaning data .panelset[ .panel[.panel-name[Dirty] ```r df <- tibble(country = c("Kenya", "Kennya", "kenya")) df ``` ``` ## # A tibble: 3 × 1 ## country ## <chr> ## 1 Kenya ## 2 Kennya ## 3 kenya ``` ] .panel[.panel-name[Clean] ```r df %>% mutate(country = str_to_lower(country, locale = "en"), country = case_when( country == "kennya" ~ "kenya", TRUE ~ country )) ``` ``` ## # A tibble: 3 × 1 ## country ## <chr> ## 1 kenya ## 2 kenya ## 3 kenya ``` ] ] --- class: left, hide_logo ### Transforming data .panelset[ .panel[.panel-name[Original names] ```r df <- tibble(p1cry = c("y", "n", "y"), p2eat = c("y", "y", "y"), p3sleep = c("y", "y", "n")) df ``` ``` ## # A tibble: 3 × 3 ## p1cry p2eat p3sleep ## <chr> <chr> <chr> ## 1 y y y ## 2 n y y ## 3 y y n ``` ] .panel[.panel-name[Transformed names 1] ```r df %>% setNames(str_replace(names(.), ".+?(?=\\d)", "phq")) ``` ``` ## # A tibble: 3 × 3 ## phq1cry phq2eat phq3sleep ## <chr> <chr> <chr> ## 1 y y y ## 2 n y y ## 3 y y n ``` ] .panel[.panel-name[Transformed names 2] ```r df %>% setNames(str_replace(names(.), ".+?(?=\\d)", "phq")) %>% setNames(str_replace(names(.), "(.*\\d)(.*)", "\\1_\\2")) ``` ``` ## # A tibble: 3 × 3 ## phq1_cry phq2_eat phq3_sleep ## <chr> <chr> <chr> ## 1 y y y ## 2 n y y ## 3 y y n ``` ] ] --- class: left, hide_logo ### Text mining <iframe src="https://www.tidytextmining.com/" width="100%" height="90%" data-external="1"></iframe> --- class: left, hide_logo ### Get help from sites like regex101 <img src="data:image/png;base64,#img/regex.png" width="100%" /> --- class: left, hide-count ### `RegExplain` RStudio Addin <center><img src="data:image/png;base64,#https://www.garrickadenbuie.com/images/project/regexplain/regexplain-selection.gif"/></center> --- class: left, hide-count ### `RegExplain` RStudio Addin ```r devtools::install_github("gadenbuie/regexplain") ``` --- class: left, hide-count ### Things to know about strings * You can create strings with either single quotes or double quotes. Use `"` unless `'quoting a "quote"'`. * Escape (button) a bad situation: prompt stuck at `+` * Escape (`\` slash) special characters: `c("\"", "\\")` --- class: left, hide-count ### `stringr` package <br> <center><img src="data:image/png;base64,#https://stringr.tidyverse.org/logo.png" width="200px"/></center> --- class: left, hide-count ### 7 main verbs (functions) ```r x <- c("why", "video", "cross", "extra", "deal", "authority") ``` --- class: left, hide-count ### `str_detect(x, pattern)` tells you if there’s any match to the pattern ```r x ``` ``` ## [1] "why" "video" "cross" "extra" "deal" "authority" ``` ```r str_detect(x, "[aeiou]") ``` ``` ## [1] FALSE TRUE TRUE TRUE TRUE TRUE ``` --- class: left, hide-count ### `str_count(x, pattern)` counts the number of patterns ```r x ``` ``` ## [1] "why" "video" "cross" "extra" "deal" "authority" ``` ```r str_count(x, "[aeiou]") ``` ``` ## [1] 0 3 1 2 2 4 ``` --- class: left, hide-count ### `str_subset(x, pattern)` extracts the matching components ```r x ``` ``` ## [1] "why" "video" "cross" "extra" "deal" "authority" ``` ```r str_subset(x, "[aeiou]") ``` ``` ## [1] "video" "cross" "extra" "deal" "authority" ``` --- class: left, hide-count ### `str_locate(x, pattern)` gives the position of the match ```r x ``` ``` ## [1] "why" "video" "cross" "extra" "deal" "authority" ``` ```r str_locate(x, "[aeiou]") ``` ``` ## start end ## [1,] NA NA ## [2,] 2 2 ## [3,] 3 3 ## [4,] 1 1 ## [5,] 2 2 ## [6,] 1 1 ``` --- class: left, hide-count ### `str_extract(x, pattern)` extracts the text of the match ```r x ``` ``` ## [1] "why" "video" "cross" "extra" "deal" "authority" ``` ```r str_extract(x, "[aeiou]") ``` ``` ## [1] NA "i" "o" "e" "e" "a" ``` --- class: left, hide-count ### `str_match(x, pattern)` extracts parts of the match defined by parentheses ```r x ``` ``` ## [1] "why" "video" "cross" "extra" "deal" "authority" ``` ```r # extract the characters on either side of the vowel str_match(x, "(.)[aeiou](.)") ``` ``` ## [,1] [,2] [,3] ## [1,] NA NA NA ## [2,] "vid" "v" "d" ## [3,] "ros" "r" "s" ## [4,] NA NA NA ## [5,] "dea" "d" "a" ## [6,] "aut" "a" "t" ``` --- class: left, hide-count ### `str_replace(x, pattern, replacement)` replaces the matches with new text ```r x ``` ``` ## [1] "why" "video" "cross" "extra" "deal" "authority" ``` ```r str_replace(x, "[aeiou]", "?") ``` ``` ## [1] "why" "v?deo" "cr?ss" "?xtra" "d?al" "?uthority" ``` ```r str_replace_all(x, "[aeiou]", "?") ``` ``` ## [1] "why" "v?d??" "cr?ss" "?xtr?" "d??l" "??th?r?ty" ``` --- class: left, hide-count ### `str_split(x, pattern)` splits up a string into multiple pieces ```r x ``` ``` ## [1] "why" "video" "cross" "extra" "deal" "authority" ``` ```r str_split(c("a,b", "c,d,e"), ",") ``` ``` ## [[1]] ## [1] "a" "b" ## ## [[2]] ## [1] "c" "d" "e" ``` --- class: newTopicSub, hide_logo # Over to RStudio to exercise --- class: left, hide-count # Credits Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) * {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} * [R for Data Science](https://r4ds.had.co.nz/index.html), by Wickham and Grolemund * {[`stringr`](https://stringr.tidyverse.org/)} package * [Albert Kim's tutorial](https://rstudio-pubs-static.s3.amazonaws.com/74603_76cd14d5983f47408fdf0b323550b846.html) adapted from STAT545 materials