Objectives

By the end of this session, you should be able to:

  • State the benefits of a reproducible analysis
  • Explain the layout of the RStudio interface
  • Write Rmarkdown syntax and knit an Rmarkdown file
  • Read a R help file
  • Import data from CSV, RDS, Stata, SAS, SPSS, URLs, and APIs
  • Inspect R objects
  • Conduct initial data checks
  • Reproduce this plot…

Today’s Goal

Install R and RStudio

  • Copy the workshop folder from the USB drive to your desktop
  • Open the installers folder and find the correct folder for your operating system
  • Install R
  • Install RStudio

Reproducibility

I will show you how to make your analyses reproducible. We will write code to import raw data, do some data manipulation, analyze the data and create a plot, and generate a report. This is the beginning of the end of editing raw data files and pointing and clicking your way to an analysis that is hard to reproduce.

Literate Programming

I’ll also show you how to combine your analysis code with your write-up. This means the end is near for copying/pasting results, tables, and figures from your stats program/Excel to your report.

Data Analysis Pipeline

Tools

  • RStudio: provides a nicer interface for using R and an authoring framework for data science
  • LaTeX: helps behind the scenes to knit to PDF and offers more control over typesetting
  • Zotero, BibDesk, and .bib files: manage references

RStudio Interface

RStudio Interface

Optional: Change your layout

Tools > Global Options

Projects

You can create a new RStudio project for every real life project you have. Just tell RStudio to find the project folder on your computer. A major benefit of projects is that all of the file paths will be relative to the root directory of your project.

I created a project for you today. Find the workshop folder on your desktop and double click on the RStudio project file. This will open our project in RStudio.

Working directory

In your console run:

Since you opened our project file, R should think that your working directory is your project directory.

Relative file paths

Every file reference should be relative to this working (root) directory.

  • ../ goes up one level
  • ../../ goes up two levels
  • subdirectory/ does down one level
  • subdirectory/subsubdirectory/ goes down two levels

Typical Organization

Better Organization

New files

R script

Create a new R script file

  • similar to Stata’s do file
  • not interactive
  • output appears in console and plot window

RMarkdown

Create a new RMarkdown HTML file

Open the Attendance Report

I created a report Rmd file for you. Look for the FILES tab and open the reports folder.

A tricky thing

When you knit a document, RStudio thinks that the directory where the document is saved should be the working directory. If your template is stored in workshop/reports, for instance, reports will become the working directory and relative file paths won’t work. To fix this, we added the following to the R code chunk named setup:

This tells RStudio that the working directory is 1-level up from where the report file sits.

Markdown language

Markdown Language

Plain text, italics, bold, monospaced font

strikethrough, sub/superscript22, endash: –, emdash: —

equation: \(A = \pi*r^{2}\)

\[E = mc^{2}\]

block quote

list:

  • item 1
  • item 2
  • item 3

Write some Markdown text

Turn to the back of your Markdown cheatsheet and try writing some text under the “Markdown Practice” heading (e.g., bold, italics, lists, subheadings, web links, footnote). Then click “Knit” to compile the document.

Getting help

An unexpected result!

## [1] NA

?function

?mean

Stack Overflow

Bad questions

Reproducible Example

Low frustration tolerance?

Remember this

R does exactlly what you tell it to do, rather than what you want it to do
-Kieren Healy

  • No one writes error-free code the first time
  • De-bugging and testing are part of the process
  • Each error you make teaches you something about how R works

Common mistakes

  • Unbalanced (parentheses
  • Completing expressions (+ vs. > in the command line)
  • Code that wraps across lines (+)
  • Spelling!

Let’s import some data

  • csv
  • Excel
  • Google Sheets
  • Stata
  • APIs

CSV

A csv file is an ideal format for sharing data. Simple. Lightweight. Readable by any program. Import with the read.csv() function. Start by running ?read.csv in the console to view the help file.

? Help File

?read.csv: What arguments are required?

Today’s Example

Today’s Example

The data underlying this plot are stored in 4 csv files, one for each community.

Import the Data

Let’s import the first csv file. Type the following in the import code chunk.

This will create an object called c2AG. You can call it whatver you want, but for today use C2AG for “Community 2 Attendance Group”. Find this object in your ENVIRONMENT and take a look.

Import the Data

Now import the three other csv files and name them c3AG, c4AG, and c5AG.

Import the Data

RDS

Any R object can be saved as an RDS file, so you might need to import from RDS on occassion:

Import from other stats programs

Maybe a collaborator works in Stata, SAS, or SPSS and wants to make the jump to R. No problem, just load the haven package:

Import from other stats programs

To export to these programs, just use the write_ functions in haven.

Import from Excel

Or maybe you have descended into the 6th circle of hell (heresy) and you are given data stored in Excel files. You can manually convert to CSV files and import, or you can use the readxl package that is part of the tidyverse.

Import flat files from the Web

You can use the RCurl package to grab data from the web. This link goes to a csv published on Google Drive.

(Check out the googlesheets package if you want to read and write from YOUR Google Sheets account.)

Import data from APIs

##  [1] "https://t.co/4OjDqTMEIx"                                                                             
##  [2] "Imagine how much wasteful spending we’d save if we didn’t have Chuck and Nancy standing in our way! "
##  [3] "The HISTORIC Rescissions Package we’ve proposed would cut $15,000,000,000 in Wasteful Spending! We a"
##  [4] "Terrific new book out by the wonderful Harris Faulkner, “9 Rules of Engagement.” Harris shares lesso"
##  [5] "Senator @RogerWicker of Mississippi has done everything necessary to Make America Great Again! Get o"
##  [6] "Vote for Congressman Devin Nunes, a true American Patriot the likes of which we rarely see in our mo"
##  [7] "Get the vote out in California today for Rep. Kevin McCarthy and all of the great GOP candidates for"
##  [8] "In High Tax, High Crime California, be sure to get out and vote for Republican John Cox for Governor"
##  [9] "Separating families at the Border is the fault of bad legislation passed by the Democrats. Border Se"
## [10] "....@NASCAR and Champion @MartinTruex_Jr were recently at the White House. It was a great day for a "

Connect to Databases

RStudio makes it easy to connect to a wide range of databases, query/analyze the data inside the database, and only import what you need into R.

See RStudio for more details.

Inspect the objects

Data frame

R has several data types. c2AG is a data frame that consists of 136 rows and 22 variables. Let’s use two built-in functions to do this count and print the results in line. Go to where you see the following line:

The datCSV data frame has … observations (rows) and … columns.

Replace the first ... with `r nrow(c2AG)` for the number of rows, and replace the second ... with `r length(c2AG)` for the number of columns. Then knit your document.

Glimpse

We can also examine c2AG with the glimpse() function in the dplyr package, which is included in the tidyverse. Create a new R code chunk and type.

Hide output

Knit the file and you’ll see that the glimpse results print. Replace {r glimpse} with {r glimpse, results='hide'} and knit again.

Stop code from printing

To turn off code printing, replace {r glimpse, results='hide'} with {r glimpse, results='hide', echo=FALSE} and knit again.

Head/Tail

You can also use the functions head() or tail() to examine the first or last few rows.

Quick exploration

One option for a quick summary of the dataframe is:

But this can be unweildly when your dataframe is big.

You can also make simple cross-tabs

names(c2AG) is a quick way to get the names of all variables (columns) in a dataframe.

Introduction to $

  • c2AG tells R to do something with the c2AG dataframe
  • c2AG followed by the $ lets you access specific columns; RStudio will prompt you with column names as soon as you type $ after the name of a dataframe

Introduction to NA

  • NA means missing data in R
  • A common trap in analysis is not understanding how the function you are using handles NAs by default

5 key wrangling functions

function purpose
filter() Pick observations by their values
arrange() Reorder the rows
select() Pick variables by their names
mutate() Create new variables
summarise() Create summaries

See Wickham and Grolemund (2017) for more details and examples

Filter

filter() lets you subset based on row values. Just give it a data frame and one or more conditions.

Logical operators and (&), or (|), not (!), equals (==). How many observations do we have of caregivers who attended the first and second sessions?

Arrange

arrange() sorts your dataframe according to one or more conditions.

Select

select() lets you grab just a subset of columns from your data frame.

Mutate

mutate() creates new columns. In this example, we first create a smaller data frame called temp to make it easier to see the new variable ageDifference that we create in the second step.

Row Bind

Before we learn how to summarise(), let’s combine our dataframes from each community into one large dataframe. Each table has the same columns, so we want to append or add the rows. We’ll use the bind_rows() function. But first, let’s add a variable to each table that indicates the community number.

Row Bind

Now we can bind the tables together.

What is the dimension of the all table (data frame)?

Summarise

summarise() collapses the data to summary values. It can be used with the group_by function to summari(z)e by group.

The Pipe

%>%

We can “pipe” together multiple steps without having to make intermediate objects like temp.

The Tidy 4

function purpose
gather() Gather variable values spread across multiple columns
spread() Spread out observation values scattered across rows
separate() Split one column into two or more columns
unite() Collapse multiple columns into one column

(http://r4ds.had.co.nz/tidy-data.html)

Gather

Next we gather the ‘wide’ session attendance data. We store the column names (S1A, S2A, etc)—the key—in a variable we call session. We also store the attendance values—1’s and 0’s—in a new variable called attend.

Now Summarise by Community

And Summarise Overall

And Plot

Grammar of graphics

  • ggplot is a tidyverse package by Hadley Wickham that implements Wilkinson’s Grammar of Graphics, a helpful approach for thinking about the components of an effective visualization of data.
  • In this session we will focus on Wickham’s implementation of this “gg” idea in his package ggplot.
  • For more background on visualization principles and what makes a good plot, see Healy (2017) for a nice overview. See also work by William Cleveland, such as The Elements of Graphing Data.

The ggplot way (Healy 2017)

The ggplot way (Healy 2017)

Start by defining the data

This line tells ggplot() which dataset to use and produces a blank plot.

Layering

For convenience, we’re going to assign each step to an object called p. You can call it whatever you want. The key idea is that we create a base plot p and add to it in each step. So here, p is just an empty plot. If you want to see the result, you have to print p.

Inspect the data

## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Declare data and mapping

The first two ggplot() arguments are data and mapping. We’ll drop the data= and mapping= labels from here out.

The aes() function

  • The mapping argument calls for aesthetic mappings of variables to plot elements.
  • Essentially, with aes() you tell ggplot() which variable from the dataset should map to the x-axis, and which should map to the y-axis.
  • Here, we are mapping two variables from the dataset gapminder: gdpPercap goes to the x-axis, while lifeExp goes to the y-axis.

What do we have so far?

Not much. We’ve just told ggplot to use the gapminder dataset and to map two variables, but we have not specified the type of plot we want.

Specify a geom()

Use the + sign to add the next layer to this plot—a geom()! In this example, we add geom_point(), the points geom.

Fine tune geom_point()

Check out the help file for your geom to learn more about use or review the great reference material on tidyverse.org: http://ggplot2.tidyverse.org/reference/geom_point.html

Pick a different geom

This geom calculates a smoothed line and shades the standard error. Check out the arguments to geom_smooth() to tinker with the smoothing function used.

Add both geoms

Rescale the x-axis

Add some scale labels

Change the look

Add some labels

Change the theme

Map aesthetics to variables

For instance, maybe instead of making all the points “purple”, we want to color the points by values in the variable continent.

Adding the geoms

Can also map shape to point values

Map fill to se

Adding the geoms

Map aesthetics per geom

Small multiples

The group trends are hard to see. Let’s try faceting by continent to make a series of “small multiples”. First we need to get back to our basic plot defining point and line color:

facet_wrap()

Make it nice

Plotting Attendance

Plotting Attendance

References

Healy, Kieran. 2017. Data Visualization for Social Science. http://socviz.co/.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.