Data Wrangling

# Data Wrangling

## Geting data into (and out of) R

---

# But first, file paths

---

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Today in &quot;always check your teaching assumptions&quot;:<br><br>I told my class they could save an R Markdown file the same way they would save any other file.<br><br>One student was confused - they had only ever used Google Docs and thus had never saved a file before. 😱 <a href="https://t.co/9wGGG00zR9">pic.twitter.com/9wGGG00zR9</a></p>&mdash; Kelly Bodwin (@KellyBodwin) <a href="https://twitter.com/KellyBodwin/status/1484094473898381312?ref_src=twsrc%5Etfw">January 20, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

---

## Open RStudio

* If you are in a project, close it.
* In your Console run: `getwd()`
* What does it return?
* Next, open `wrangling1.Rproj` and run `getwd()` again in the Console.
* What does it return?

---

## Important!

* Note that the path to your working directory is not the same as my working directory path. Therefore, if you `setwd("some_pathway/YOURNAME/wrangling1")` the code will fail on my computer. 
* **Absolute path (bad)**: `"some_pathway/YOURNAME/wrangling1/data/ducks.csv"`
* **Relative path with a project (good)**: `"wrangling1/data/ducks.csv"`
* R is picky with file names. 'Almost correct' does not cut it. You can't have extra spaces, typos, or letters in the wrong case. The same applies to paths to the files.

---

## What the duck?

* Look for the template Rmd file in `materials/wrangling1`
* Run the `setup` chunk and then scroll to the `duck` chunk
* What happens when you run `knitr::include_graphics("duck.png")`?
* How can you fix it?

Windows users: See the note in [R4DS 8.3](https://r4ds.had.co.nz/workflow-projects.html#paths-and-directories) about slashes!

---

## Move the duck and find it again

* Copy the image `duck.png`, rename it `mallard.png`, and move `mallard.png` up a level to the `materials` folder (a simple way to do this in one step is `More Gear > Copy To...`)
* The `mallard` chunk should fail
* `../` tells R to look up one level; `../../` says "go up two levels
* Help R find `materials/mallard.png`

---

## Just when you think you understand

* Knitting a Rmarkdown file is a special action: when a document is knit, the working directory for the knitting process is the directory where the Rmd file lives, not the project root
* Open the `analysis/duck.Rmd` file to see this

---
class: left, hide-count

## Tip

* Always use projects
* Always use `here::here()`
* Never `setwd()`

---
class: newTopicSub, hide_logo

# If you can find it, you can import it into R. Even shitty little Stata files.

---

# Importing data, local

* Local files
* csv: `read_csv()`
* RData: `load()`
* Stata: `haven::read_dta()`
* SPSS: `haven::read_sav()`
* SAS: `haven::read_sas()`

---
class: left, hide-count

## Local files

You're probably used to storing data on your computer and opening with software programs like Excel, SAS, SPSS, etc, so let's start there. All you need are two pieces of information:

1. The type of file so we can use the correct function.
2. The file path.

---
class: left, hide-count

## csv

Comma separated data are the workhorse of data science. You'll come to appreciate when people share data in this format, so pay it forward when you start sharing your own files. You currently have three main options for reading csv files (slowest to fastest):

1. Base R: `read.csv()`
2. **Tidyverse `{readr}`: `read_csv()`**
3. `{data.table}`: `fread()`

*If you use `read.csv()` (pls don't), you need to set `stringsAsFactors=FALSE` for [reasons](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/)

---
class: left, hide-count
## `read_csv()`

```r
  df_csv <- read_csv(here::here("data", "confirmed_global.csv"))
  spec(df_csv)
```

```
## cols(
##   `Province/State` = col_character(),
##   `Country/Region` = col_character(),
##   Lat = col_double(),
##   Long = col_double(),
##   `1/22/20` = col_double(),
##   `1/23/20` = col_double(),
##   `1/24/20` = col_double(),
##   `1/25/20` = col_double(),
##   `1/26/20` = col_double(),
##   `1/27/20` = col_double(),
##   `1/28/20` = col_double(),
##   `1/29/20` = col_double(),
##   `1/30/20` = col_double(),
##   `1/31/20` = col_double(),
##   `2/1/20` = col_double(),
##   `2/2/20` = col_double(),
##   `2/3/20` = col_double(),
##   `2/4/20` = col_double(),
##   `2/5/20` = col_double(),
##   `2/6/20` = col_double(),
##   `2/7/20` = col_double(),
##   `2/8/20` = col_double(),
##   `2/9/20` = col_double(),
##   `2/10/20` = col_double(),
##   `2/11/20` = col_double(),
##   `2/12/20` = col_double(),
##   `2/13/20` = col_double(),
##   `2/14/20` = col_double(),
##   `2/15/20` = col_double(),
##   `2/16/20` = col_double(),
##   `2/17/20` = col_double(),
##   `2/18/20` = col_double(),
##   `2/19/20` = col_double(),
##   `2/20/20` = col_double(),
##   `2/21/20` = col_double(),
##   `2/22/20` = col_double(),
##   `2/23/20` = col_double(),
##   `2/24/20` = col_double(),
##   `2/25/20` = col_double(),
##   `2/26/20` = col_double(),
##   `2/27/20` = col_double(),
##   `2/28/20` = col_double(),
##   `2/29/20` = col_double(),
##   `3/1/20` = col_double(),
##   `3/2/20` = col_double(),
##   `3/3/20` = col_double(),
##   `3/4/20` = col_double(),
##   `3/5/20` = col_double(),
##   `3/6/20` = col_double(),
##   `3/7/20` = col_double(),
##   `3/8/20` = col_double(),
##   `3/9/20` = col_double(),
##   `3/10/20` = col_double(),
##   `3/11/20` = col_double(),
##   `3/12/20` = col_double(),
##   `3/13/20` = col_double(),
##   `3/14/20` = col_double(),
##   `3/15/20` = col_double(),
##   `3/16/20` = col_double(),
##   `3/17/20` = col_double(),
##   `3/18/20` = col_double(),
##   `3/19/20` = col_double(),
##   `3/20/20` = col_double(),
##   `3/21/20` = col_double(),
##   `3/22/20` = col_double(),
##   `3/23/20` = col_double(),
##   `3/24/20` = col_double(),
##   `3/25/20` = col_double(),
##   `3/26/20` = col_double(),
##   `3/27/20` = col_double(),
##   `3/28/20` = col_double(),
##   `3/29/20` = col_double(),
##   `3/30/20` = col_double(),
##   `3/31/20` = col_double(),
##   `4/1/20` = col_double(),
##   `4/2/20` = col_double(),
##   `4/3/20` = col_double(),
##   `4/4/20` = col_double(),
##   `4/5/20` = col_double(),
##   `4/6/20` = col_double(),
##   `4/7/20` = col_double(),
##   `4/8/20` = col_double(),
##   `4/9/20` = col_double(),
##   `4/10/20` = col_double(),
##   `4/11/20` = col_double(),
##   `4/12/20` = col_double(),
##   `4/13/20` = col_double(),
##   `4/14/20` = col_double(),
##   `4/15/20` = col_double()
## )
```

---
class: left, hide-count
## Classes

A nice feature of `read_csv()` is that it tells us which class it guessed for each column in the imported object.

* Logical: TRUE or FALSE
* Integer: Whole numbers
* Double: All other numbers (e.g., 3.4)
* Character: Strings (e.g, "cat")
* Factors: Great for working with categorical variables
* Dates: `<date>`, `<time>`, and `<dttm>` (date-time)

---
class: left, hide-count

## Custom specifications

In 99% of cases `read_csv("path/file.csv")` is all you need, but `{readr}` has you covered for the 1%. See [this vignette](https://readr.tidyverse.org/articles/readr.html) for a detailed look at your customization options.

```r
# import `1/22/20` as an integer
  df_csv <- read_csv(here::here("data", "confirmed_global.csv"),
                     col_types = list(
                       `1/22/20` = col_integer())) # import as integer
  spec(df_csv)
```

```
## cols(
##   `Province/State` = col_character(),
##   `Country/Region` = col_character(),
##   Lat = col_double(),
##   Long = col_double(),
##   `1/22/20` = col_integer(),
##   `1/23/20` = col_double(),
##   `1/24/20` = col_double(),
##   `1/25/20` = col_double(),
##   `1/26/20` = col_double(),
##   `1/27/20` = col_double(),
##   `1/28/20` = col_double(),
##   `1/29/20` = col_double(),
##   `1/30/20` = col_double(),
##   `1/31/20` = col_double(),
##   `2/1/20` = col_double(),
##   `2/2/20` = col_double(),
##   `2/3/20` = col_double(),
##   `2/4/20` = col_double(),
##   `2/5/20` = col_double(),
##   `2/6/20` = col_double(),
##   `2/7/20` = col_double(),
##   `2/8/20` = col_double(),
##   `2/9/20` = col_double(),
##   `2/10/20` = col_double(),
##   `2/11/20` = col_double(),
##   `2/12/20` = col_double(),
##   `2/13/20` = col_double(),
##   `2/14/20` = col_double(),
##   `2/15/20` = col_double(),
##   `2/16/20` = col_double(),
##   `2/17/20` = col_double(),
##   `2/18/20` = col_double(),
##   `2/19/20` = col_double(),
##   `2/20/20` = col_double(),
##   `2/21/20` = col_double(),
##   `2/22/20` = col_double(),
##   `2/23/20` = col_double(),
##   `2/24/20` = col_double(),
##   `2/25/20` = col_double(),
##   `2/26/20` = col_double(),
##   `2/27/20` = col_double(),
##   `2/28/20` = col_double(),
##   `2/29/20` = col_double(),
##   `3/1/20` = col_double(),
##   `3/2/20` = col_double(),
##   `3/3/20` = col_double(),
##   `3/4/20` = col_double(),
##   `3/5/20` = col_double(),
##   `3/6/20` = col_double(),
##   `3/7/20` = col_double(),
##   `3/8/20` = col_double(),
##   `3/9/20` = col_double(),
##   `3/10/20` = col_double(),
##   `3/11/20` = col_double(),
##   `3/12/20` = col_double(),
##   `3/13/20` = col_double(),
##   `3/14/20` = col_double(),
##   `3/15/20` = col_double(),
##   `3/16/20` = col_double(),
##   `3/17/20` = col_double(),
##   `3/18/20` = col_double(),
##   `3/19/20` = col_double(),
##   `3/20/20` = col_double(),
##   `3/21/20` = col_double(),
##   `3/22/20` = col_double(),
##   `3/23/20` = col_double(),
##   `3/24/20` = col_double(),
##   `3/25/20` = col_double(),
##   `3/26/20` = col_double(),
##   `3/27/20` = col_double(),
##   `3/28/20` = col_double(),
##   `3/29/20` = col_double(),
##   `3/30/20` = col_double(),
##   `3/31/20` = col_double(),
##   `4/1/20` = col_double(),
##   `4/2/20` = col_double(),
##   `4/3/20` = col_double(),
##   `4/4/20` = col_double(),
##   `4/5/20` = col_double(),
##   `4/6/20` = col_double(),
##   `4/7/20` = col_double(),
##   `4/8/20` = col_double(),
##   `4/9/20` = col_double(),
##   `4/10/20` = col_double(),
##   `4/11/20` = col_double(),
##   `4/12/20` = col_double(),
##   `4/13/20` = col_double(),
##   `4/14/20` = col_double(),
##   `4/15/20` = col_double()
## )
```

---
class: left, hide-count

## Always inspect your data after import

```r
  lots <- read_csv(here::here("data", "lots.csv"))
  
  lots %>% summarise(mean_dens = mean(Density_sqkm))
```

```
## # A tibble: 1 × 1
##   mean_dens
##       <dbl>
## 1        NA
```

---
class: left, hide-count

## Always inspect your data after import

```r
  lots
```

```
## # A tibble: 7 × 4
##    Year Species Density_sqmi Density_sqkm            
##   <dbl> <chr>          <dbl> <chr>                   
## 1  2002 BUFF         2.91    1.122660719             
## 2  2002 HOME         0.00571 0.00220562              
## 3  2002 MERG         0.0171  0.006616861             
## 4  2002 REDH        53.1     20.51226854             
## 5  2002 RUDU         0.0685  0.026467443             
## 6  2002 SCAU        55.5     21.43862906             
## 7  2002 SCOT         0.474   there are a lot of ducks
```
---
class: left, hide-count

## Ignoring NAs doesn't work because the class is character

```r
  #?mean
  lots %>% summarise(mean_dens = mean(Density_sqkm, na.rm=TRUE))
```

```
## # A tibble: 1 × 1
##   mean_dens
##       <dbl>
## 1        NA
```
---
class: left, hide-count

## "There are a lot of ducks" needs to be missing

```r
  lots %>% 
    mutate(Density_sqkm = case_when(
             Density_sqkm == "there are a lot of ducks" ~ NA_character_,
             TRUE ~ Density_sqkm),
           Density_sqkm = as.numeric(Density_sqkm)) %>%
    summarise(mean_dens = mean(Density_sqkm, na.rm=TRUE))
```

```
## # A tibble: 1 × 1
##   mean_dens
##       <dbl>
## 1      7.18
```

---
class: left, hide-count

## Rdata

`.RData` files are files specific to R. A really slick feature is that you can save multiple objects to one `.RData` file and load them all at once with `load()`. This can be handy when you need to save an interim step of your analysis and not lose any information converting to different file formats.

```r
  save(lots, file = here::here("data", "example.RData"))
  load("data/confirmed_global.RData")
```

---
class: left, hide-count

## Excel

There are several options for reading in Excel files, but I think the smoothest is the `read_excel()` function in the `{readxl}` package (part of the tidyverse).

```r
  df_xls <- 
    readxl::read_excel(here::here("data", "confirmed_global.xlsx"),
                       sheet="myData")
```

---
class: left, hide-count

## Stata, SAS, SPSS

The `{haven}` package, a lesser known star in the `tidyverse`, has several functions for reading and writing data from/to other stats programs.

```r
  library(haven)
  df_dta <- read_dta(here::here("data", "confirmed_global.dta"))
  df_spss <- read_sav(here::here("data", "confirmed_global.sav"))
  df_sas <- read_sas(here::here("data", "confirmed_global.sas7bdat"))
```

---

# Importing data, web

* csv: `read_csv(INSERT URL HERE)`
* Google Sheets: `googlesheets4::read_sheet()`
* copy/paste with `{datapasta}`
* APIs with `{httr}` and `{jsonlite}`
* web scraping with `{rvest}`
* Databases with `{DBI}` and a database backend

---
class: left, hide-count

## csv

If you're lucky, you might find the data you want online posted as a csv file. Simply use `read_csv()` to read it into R by passing a URL instead of a local file name.

```r
  myUrl <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

df_url <- read_csv(myUrl)
```
---
class: left, hide-count

## Google Sheets

The `{googlesheets4}` package, also a member of the `tidyverse`, lets you read and write from/to Google Sheets. Running `sheets_deauth()` after loading the package will turn off authentication so you can read publicly shared sheets without having to login.

```r
  library(googlesheets4)
  gs4_deauth()

gsURL <- "https://docs.google.com/spreadsheets/d/1ibAj_plJBjumAvr8P8_TzwWaU8gTciOeuTwVjN4cZIM/edit?usp=sharing"
  
  df_gs <- read_sheet(gsURL)
```

---
class: left, hide-count

## APIs

Sometimes data are available via APIs—application programming interfaces. You read the API "docs" to understand how to make and parse requests via R.

---
class: left, hide-count

## Copy/paste with `{datapasta}`

---
class: left, hide-count

## Scraping

Sometimes you need datapasta but on a grand, automated scale. One option is called web scraping, and it can be controversial. We'll return to this in the class on iteration.

---
class: left, hide-count

## Databases

Large projects and applications will often store data in a database hosted in the cloud. You can access these databases from R with the `{DBI}` package (and a DB specific package like `{RPostgreSQL}`) and query the data with `{dbplyr}`, a database backend that functions like `{dplyr}`.

---

# Exporting data

---
class: left, hide-count

## Writing data to your computer

* `readr::write_csv(object, file = "path")`
* `writexl::write_xlsx(object, file = "path")` (see `??writexl::write_xlsx` for saving multiple objects to different sheets)
* `save(object, file = "path")`

---
class: left, hide-count

# Credits

Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)

* {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} 
* Data types example inspired by [Data Science in a Box](https://datasciencebox.org/)
*