Data Wrangling

# Data Wrangling

## Strings

---
class: left, hide-count

### Me, every time I work with strings. Every time.

---

# Why do we care?

---

### Cleaning data

```r
df <- tibble(country = c("Kenya", 
                         "Kennya", 
                         "kenya"))
df
```

```
## # A tibble: 3 × 1
##   country
##   <chr>  
## 1 Kenya  
## 2 Kennya 
## 3 kenya
```
]
.panel[.panel-name[Clean]

```r
df %>%
  mutate(country = str_to_lower(country, 
                                locale = "en"),
         country = case_when(
           country == "kennya" ~ "kenya",
           TRUE ~ country
         ))
```

```
## # A tibble: 3 × 1
##   country
##   <chr>  
## 1 kenya  
## 2 kenya  
## 3 kenya
```
]
]

---

### Transforming data

```r
df <- tibble(p1cry = c("y", "n", "y"),
             p2eat = c("y", "y", "y"),
             p3sleep = c("y", "y", "n"))
df
```

```
## # A tibble: 3 × 3
##   p1cry p2eat p3sleep
##   <chr> <chr> <chr>  
## 1 y     y     y      
## 2 n     y     y      
## 3 y     y     n
```
]
.panel[.panel-name[Transformed names 1]

```r
df %>%
  setNames(str_replace(names(.), ".+?(?=\\d)", "phq"))
```

```
## # A tibble: 3 × 3
##   phq1cry phq2eat phq3sleep
##   <chr>   <chr>   <chr>    
## 1 y       y       y        
## 2 n       y       y        
## 3 y       y       n
```
]

```r
df %>%
  setNames(str_replace(names(.), ".+?(?=\\d)", "phq")) %>%
  setNames(str_replace(names(.), "(.*\\d)(.*)", "\\1_\\2"))
```

```
## # A tibble: 3 × 3
##   phq1_cry phq2_eat phq3_sleep
##   <chr>    <chr>    <chr>     
## 1 y        y        y         
## 2 n        y        y         
## 3 y        y        n
```
]
]

---
class: left, hide_logo

### Text mining

---

### Get help from sites like regex101

---
class: left, hide-count

### `RegExplain` RStudio Addin

---
class: left, hide-count

### `RegExplain` RStudio Addin

```r
devtools::install_github("gadenbuie/regexplain")
```

---
class: left, hide-count

### Things to know about strings

* You can create strings with either single quotes or double quotes. Use `"` unless `'quoting a "quote"'`.
* Escape (button) a bad situation: prompt stuck at `+`
* Escape (`\` slash) special characters: `c("\"", "\\")`

---
class: left, hide-count

### `stringr` package

---
class: left, hide-count

### 7 main verbs (functions)

```r
x <- c("why", "video", "cross", "extra", "deal", "authority")
```

---
class: left, hide-count

### `str_detect(x, pattern)`

tells you if there’s any match to the pattern

```r
x
```

```
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
```

```r
str_detect(x, "[aeiou]")
```

```
## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
```

---
class: left, hide-count

### `str_count(x, pattern)`

counts the number of patterns

```r
x
```

```
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
```

```r
str_count(x, "[aeiou]")
```

```
## [1] 0 3 1 2 2 4
```

---
class: left, hide-count

### `str_subset(x, pattern)`

extracts the matching components

```r
x
```

```
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
```

```r
str_subset(x, "[aeiou]")
```

```
## [1] "video"     "cross"     "extra"     "deal"      "authority"
```

---

### `str_locate(x, pattern)`

gives the position of the match

```r
x
```

```
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
```

```r
str_locate(x, "[aeiou]")
```

```
##      start end
## [1,]    NA  NA
## [2,]     2   2
## [3,]     3   3
## [4,]     1   1
## [5,]     2   2
## [6,]     1   1
```

---

### `str_extract(x, pattern)`

extracts the text of the match

```r
x
```

```
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
```

```r
str_extract(x, "[aeiou]")
```

```
## [1] NA  "i" "o" "e" "e" "a"
```
---

### `str_match(x, pattern)`

extracts parts of the match defined by parentheses

```r
x
```

```
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
```

```r
# extract the characters on either side of the vowel
str_match(x, "(.)[aeiou](.)")
```

```
##      [,1]  [,2] [,3]
## [1,] NA    NA   NA  
## [2,] "vid" "v"  "d" 
## [3,] "ros" "r"  "s" 
## [4,] NA    NA   NA  
## [5,] "dea" "d"  "a" 
## [6,] "aut" "a"  "t"
```

---

### `str_replace(x, pattern, replacement)`

replaces the matches with new text

```r
x
```

```
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
```

```r
str_replace(x, "[aeiou]", "?")
```

```
## [1] "why"       "v?deo"     "cr?ss"     "?xtra"     "d?al"      "?uthority"
```

```r
str_replace_all(x, "[aeiou]", "?")
```

```
## [1] "why"       "v?d??"     "cr?ss"     "?xtr?"     "d??l"      "??th?r?ty"
```

---

### `str_split(x, pattern)`

splits up a string into multiple pieces

```r
x
```

```
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
```

```r
str_split(c("a,b", "c,d,e"), ",")
```

```
## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] "c" "d" "e"
```

---

# Over to RStudio to exercise

---
class: left, hide-count

# Credits

Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)

* {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} 
* [R for Data Science](https://r4ds.had.co.nz/index.html), by Wickham and Grolemund 
* {[`stringr`](https://stringr.tidyverse.org/)} package
* [Albert Kim's tutorial](https://rstudio-pubs-static.s3.amazonaws.com/74603_76cd14d5983f47408fdf0b323550b846.html) adapted from STAT545 materials