class: titleSlide, hide_logo # Data Wrangling ## Factors <br> <center><img src="data:image/png;base64,#logo.png" width="200px"/></center> --- class: left, hide_logo ## Factors R uses factors to handle categorical variables, variables that have a fixed and known set of possible values ```r x <- factor(c("BS", "MS", "PhD", "MS")) x ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` --- class: left, hide_logo ### What are factors We can think of factors like character (level labels) and an integer (level numbers) glued together ```r glimpse(x) ``` ``` ## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2 ``` ```r as.integer(x) ``` ``` ## [1] 1 2 3 2 ``` --- class: left, hide_logo ### Can be tricky ```r x <- factor(c("never 0", "2", "2", "4")) levels(x) ``` ``` ## [1] "2" "4" "never 0" ``` -- ```r as.integer(x) ``` ``` ## [1] 3 1 1 2 ``` -- ```r mean(as.integer(x)) # it's not 8/2 = 4 ``` ``` ## [1] 1.75 ``` --- class: left, hide_logo ### Get to know your factors ```r str(gapminder$continent) ``` ``` ## Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ... ``` ```r levels(gapminder$continent) ``` ``` ## [1] "Africa" "Americas" "Asia" "Europe" "Oceania" ``` ```r nlevels(gapminder$continent) ``` ``` ## [1] 5 ``` --- class: left, hide_logo ### Get to know your factors ```r glimpse(gapminder) ``` ``` ## Rows: 1,704 ## Columns: 6 ## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", … ## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, … ## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, … ## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8… ## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12… ## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, … ``` --- class: left, hide_logo ### A few ways to count .pull-left[ ```r gapminder %>% count(continent) ``` ``` ## # A tibble: 5 × 2 ## continent n ## <fct> <int> ## 1 Africa 624 ## 2 Americas 300 ## 3 Asia 396 ## 4 Europe 360 ## 5 Oceania 24 ``` ] .pull-right[ ```r fct_count(gapminder$continent) ``` ``` ## # A tibble: 5 × 2 ## f n ## <fct> <int> ## 1 Africa 624 ## 2 Americas 300 ## 3 Asia 396 ## 4 Europe 360 ## 5 Oceania 24 ``` ] --- class: left, hide_logo ### Default order is alphabetical ```r gapminder$continent %>% levels() ``` ``` ## [1] "Africa" "Americas" "Asia" "Europe" "Oceania" ``` -- ```r gapminder$continent %>% fct_infreq() %>% levels() ``` ``` ## [1] "Africa" "Asia" "Europe" "Americas" "Oceania" ``` -- ```r gapminder$continent %>% fct_infreq() %>% fct_rev() %>% levels() ``` ``` ## [1] "Oceania" "Americas" "Europe" "Asia" "Africa" ``` --- class: left, hide_logo ### Change order of the levels, principled .panelset[ .panel[.panel-name[Default] ```r gapminder %>% count(continent, name="count") %>% ggplot(aes(y = continent, x = count)) + geom_col() ``` <img src="data:image/png;base64,#wrangling6_deck_files/figure-html/unnamed-chunk-13-1.png" width="70%" /> ] .panel[.panel-name[By frequency] ```r gapminder %>% mutate(continent = fct_infreq(continent)) %>% count(continent, name="count") %>% ggplot(aes(y = continent, x = count)) + geom_col() ``` <img src="data:image/png;base64,#wrangling6_deck_files/figure-html/unnamed-chunk-14-1.png" width="70%" /> ] .panel[.panel-name[Reverse] ```r gapminder %>% mutate(continent = fct_infreq(continent) %>% fct_rev()) %>% count(continent, name="count") %>% ggplot(aes(y = continent, x = count)) + geom_col() ``` <img src="data:image/png;base64,#wrangling6_deck_files/figure-html/unnamed-chunk-15-1.png" width="70%" /> ] ] --- class: left, hide_logo ### Order by a second variable .panelset[ .panel[.panel-name[Default] ```r gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia") ggplot(gap_asia_2007, aes(x = lifeExp, y = country)) + geom_point() ``` <img src="data:image/png;base64,#wrangling6_deck_files/figure-html/unnamed-chunk-16-1.png" width="70%" /> ] .panel[.panel-name[By life expectancy] ```r gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia") ggplot(gap_asia_2007, aes(x = lifeExp, y = fct_reorder(country, lifeExp))) + geom_point() ``` <img src="data:image/png;base64,#wrangling6_deck_files/figure-html/unnamed-chunk-17-1.png" width="70%" /> ] ] --- class: left, hide_logo ### When your factor provides the color or fill .panelset[ .panel[.panel-name[Setup] ```r h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela") h_gap <- gapminder %>% filter(country %in% h_countries) %>% droplevels() levels(h_gap$country) ``` ``` ## [1] "Egypt" "Haiti" "Romania" "Thailand" "Venezuela" ``` ] .panel[.panel-name[Default] ```r ggplot(h_gap, aes(x = year, y = lifeExp, color = country)) + geom_line() ``` <img src="data:image/png;base64,#wrangling6_deck_files/figure-html/unnamed-chunk-19-1.png" width="70%" /> ] .panel[.panel-name[`fct_reorder2()`] ```r ggplot(h_gap, aes(x = year, y = lifeExp, color = fct_reorder2(country, year, lifeExp))) + geom_line() + labs(color = "country") ``` <img src="data:image/png;base64,#wrangling6_deck_files/figure-html/unnamed-chunk-20-1.png" width="70%" /> ] ] --- class: left, hide_logo ### Change order of the levels, "because I said so" ```r h_gap$country %>% levels() ``` ``` ## [1] "Egypt" "Haiti" "Romania" "Thailand" "Venezuela" ``` ```r h_gap$country %>% fct_relevel("Romania", "Haiti") %>% levels() ``` ``` ## [1] "Romania" "Haiti" "Egypt" "Thailand" "Venezuela" ``` --- class: left, hide_logo ### Recode the levels ```r i_gap <- gapminder %>% filter(country %in% c("United States", "Sweden", "Australia")) %>% droplevels() i_gap$country %>% levels() ``` ``` ## [1] "Australia" "Sweden" "United States" ``` ```r i_gap$country %>% fct_recode("USA" = "United States", "Oz" = "Australia") %>% levels() ``` ``` ## [1] "Oz" "Sweden" "USA" ``` --- class: left, hide_logo ### More recoding ```r gss_cat %>% count(partyid) ``` ``` ## # A tibble: 10 × 2 ## partyid n ## <fct> <int> ## 1 No answer 154 ## 2 Don't know 1 ## 3 Other party 393 ## 4 Strong republican 2314 ## 5 Not str republican 3032 ## 6 Ind,near rep 1791 ## 7 Independent 4119 ## 8 Ind,near dem 2499 ## 9 Not str democrat 3690 ## 10 Strong democrat 3490 ``` --- class: left, hide_logo ### Clean up the levels with `fct_recode()` .panelset[ .panel[.panel-name[Summary] ```r gss_cat %>% mutate(partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat" )) %>% count(partyid) ``` ] .panel[.panel-name[Summary] ``` ## # A tibble: 10 × 2 ## partyid n ## <fct> <int> ## 1 No answer 154 ## 2 Don't know 1 ## 3 Other party 393 ## 4 Republican, strong 2314 ## 5 Republican, weak 3032 ## 6 Independent, near rep 1791 ## 7 Independent 4119 ## 8 Independent, near dem 2499 ## 9 Democrat, weak 3690 ## 10 Democrat, strong 3490 ``` ] ] --- class: left, hide_logo ### Do a bunch with `fct_collapse()` ```r gss_cat %>% mutate(partyid = fct_collapse(partyid, other = c("No answer", "Don't know", "Other party"), rep = c("Strong republican", "Not str republican"), ind = c("Ind,near rep", "Independent", "Ind,near dem"), dem = c("Not str democrat", "Strong democrat") )) %>% count(partyid) ``` ``` ## # A tibble: 4 × 2 ## partyid n ## <fct> <int> ## 1 other 548 ## 2 rep 5346 ## 3 ind 8409 ## 4 dem 7180 ``` --- class: left, hide_logo ### Lump small groups with `fct_lump()` .panelset[ .panel[.panel-name[Look at the data] ```r gss_cat %>% count(relig) ``` ``` ## # A tibble: 15 × 2 ## relig n ## <fct> <int> ## 1 No answer 93 ## 2 Don't know 15 ## 3 Inter-nondenominational 109 ## 4 Native american 23 ## 5 Christian 689 ## 6 Orthodox-christian 95 ## 7 Moslem/islam 104 ## 8 Other eastern 32 ## 9 Hinduism 71 ## 10 Buddhism 147 ## 11 Other 224 ## 12 None 3523 ## 13 Jewish 388 ## 14 Catholic 5124 ## 15 Protestant 10846 ``` ] .panel[.panel-name[Lump] ```r gss_cat %>% mutate(relig = fct_lump(relig, n = 5)) %>% count(relig, sort = TRUE) %>% print(n = Inf) ``` ``` ## # A tibble: 6 × 2 ## relig n ## <fct> <int> ## 1 Protestant 10846 ## 2 Catholic 5124 ## 3 None 3523 ## 4 Other 913 ## 5 Christian 689 ## 6 Jewish 388 ``` ] ] --- class: left, hide-count # Credits Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) * {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} * [R for Data Science](https://r4ds.had.co.nz/index.html), by Wickham and Grolemund * [Data Science in a Box](https://datasciencebox.org/) * STAT 545, [Be the boss of your factors](https://stat545.com/factors-boss.html)