class: titleSlide, hide_logo # Data Wrangling ## Dates and Times <br> <center><img src="data:image/png;base64,#logo.png" width="200px"/></center> --- class: left, hide_logo ## Dates are just hard <blockquote class="twitter-tweet"><p lang="en" dir="ltr">excel: is that a date?<br><br>me: 57.39 is very much not a date<br><br>excel: strong date vibes to me<br><br>me: h-how<br><br>excel: fixed it<br><br>me: 57/39/2020?<br><br>excel: you’re welcome</p>— slate (@PleaseBeGneiss) <a href="https://twitter.com/PleaseBeGneiss/status/1328735477923336192?ref_src=twsrc%5Etfw">November 17, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> --- class: left, hide_logo ### Lubridate makes working with dates easier <br> <center><img src="data:image/png;base64,#https://uc-r.github.io/Intro-R/images/lubridate-large.png" width="200px"/></center> --- class: left, hide_logo ### Here's today's date, no time included .pull-left[ **Base R** ```r Sys.Date() ``` ``` ## [1] "2022-02-15" ``` ```r class(Sys.Date()) ``` ``` ## [1] "Date" ``` ] .pull-right[ **Lubridate** ```r today() ``` ``` ## [1] "2022-02-15" ``` ```r class(today()) ``` ``` ## [1] "Date" ``` ] --- class: left, hide_logo ### Here's the current moment (date+time) R calls timestamps POSIXct, but tibbles will show `<dttm>` .pull-left[ **Base R** ```r Sys.time() ``` ``` ## [1] "2022-02-15 12:12:31 EST" ``` ```r class(Sys.time()) ``` ``` ## [1] "POSIXct" "POSIXt" ``` ] .pull-right[ **Lubridate** ```r now() ``` ``` ## [1] "2022-02-15 12:12:31 EST" ``` ```r class(now()) ``` ``` ## [1] "POSIXct" "POSIXt" ``` ] --- class: left, hide_logo ### String dates need to be parsed Lubridate makes this easy .pull-left[ ```r x <- "January 31st, 2017" class(x) ``` ``` ## [1] "character" ``` ```r # x + 1 # fails at string + number ``` ] .pull-right[ ```r x <- mdy("January 31st, 2017") class(x) ``` ``` ## [1] "Date" ``` ```r x + 1 ``` ``` ## [1] "2017-02-01" ``` ```r # date plus 1 day works ``` ] --- class: left, hide_logo ### Building dates/times from components ```r df <- tribble( ~year, ~month, ~day, 2022, 10, 10, 2021, 8, 1, 2020, 1, 31 ) df %>% mutate(myDate = make_date(year, month, day)) ``` ``` ## # A tibble: 3 × 4 ## year month day myDate ## <dbl> <dbl> <dbl> <date> ## 1 2022 10 10 2022-10-10 ## 2 2021 8 1 2021-08-01 ## 3 2020 1 31 2020-01-31 ``` --- class: left, hide_logo ### Extracting components from dates/times ```r df %>% mutate(myDate = make_date(year, month, day), month_label = month(myDate, label = TRUE)) ``` ``` ## # A tibble: 3 × 5 ## year month day myDate month_label ## <dbl> <dbl> <dbl> <date> <ord> ## 1 2022 10 10 2022-10-10 Oct ## 2 2021 8 1 2021-08-01 Aug ## 3 2020 1 31 2020-01-31 Jan ``` --- class: left, hide_logo ### Rounding .panelset[ .panel[.panel-name[Get Some Data] .pull-left[ ```r set.seed(8675309) onboarding <- tibble( date_onboarded = sample(seq(ymd('2021-01-01'), ymd('2021-12-31'), by="day"), 3000, replace = TRUE), channel = sample(c("WhatsApp", "Messenger"), 3000, replace=TRUE) ) ``` ] .pull-right[ ``` ## # A tibble: 3,000 × 2 ## date_onboarded channel ## <date> <chr> ## 1 2021-07-31 Messenger ## 2 2021-04-17 WhatsApp ## 3 2021-09-24 Messenger ## 4 2021-07-11 Messenger ## 5 2021-03-18 WhatsApp ## 6 2021-05-27 WhatsApp ## 7 2021-06-18 WhatsApp ## 8 2021-11-30 WhatsApp ## 9 2021-05-31 WhatsApp ## 10 2021-12-07 WhatsApp ## # … with 2,990 more rows ``` ] ] .panel[.panel-name[Round Data] .pull-left[ ```r onboarding %>% count(channel, week = floor_date(date_onboarded, "week") ) ``` ] .pull-right[ ``` ## # A tibble: 106 × 3 ## channel week n ## <chr> <date> <int> ## 1 Messenger 2020-12-27 12 ## 2 Messenger 2021-01-03 24 ## 3 Messenger 2021-01-10 25 ## 4 Messenger 2021-01-17 29 ## 5 Messenger 2021-01-24 37 ## 6 Messenger 2021-01-31 25 ## 7 Messenger 2021-02-07 39 ## 8 Messenger 2021-02-14 21 ## 9 Messenger 2021-02-21 23 ## 10 Messenger 2021-02-28 25 ## # … with 96 more rows ``` ] ] .panel[.panel-name[Round Plot] .pull-left[ ```r onboarding %>% count(channel, week = floor_date(date_onboarded, "week") ) %>% ggplot(aes(week, n, color=channel)) + geom_step() + theme_bw() ``` ] .pull-right[ <img src="data:image/png;base64,#wrangling7_deck_files/figure-html/unnamed-chunk-14-1.png" width="100%" /> ] ] ] --- class: left, hide_logo ### Time spans .panelset[ .panel[.panel-name[Get Some Data] .pull-left[ ```r engagement <- tibble( id = rep(sample(1:3000, 1000, replace=FALSE), 3), date_engage = sample(seq(ymd('2021-01-01'), ymd('2021-12-31'), by="day"), 3000, replace = TRUE)) %>% sample_n(1000, replace=FALSE) %>% arrange(id, date_engage) %>% group_by(id) %>% mutate(engage_id = 1:n()) %>% ungroup() ``` ] .pull-right[ ``` ## # A tibble: 1,000 × 3 ## id date_engage engage_id ## <int> <date> <int> ## 1 8 2021-01-07 1 ## 2 8 2021-06-14 2 ## 3 11 2021-02-25 1 ## 4 22 2021-02-01 1 ## 5 24 2021-08-25 1 ## 6 37 2021-05-11 1 ## 7 37 2021-08-06 2 ## 8 37 2021-09-05 3 ## 9 44 2021-01-13 1 ## 10 44 2021-09-10 2 ## # … with 990 more rows ``` ] ] .panel[.panel-name[Think] How can we calculate duration from first to last engagement? ``` ## # A tibble: 1,000 × 3 ## id date_engage engage_id ## <int> <date> <int> ## 1 8 2021-01-07 1 ## 2 8 2021-06-14 2 ## 3 11 2021-02-25 1 ## 4 22 2021-02-01 1 ## 5 24 2021-08-25 1 ## 6 37 2021-05-11 1 ## 7 37 2021-08-06 2 ## 8 37 2021-09-05 3 ## 9 44 2021-01-13 1 ## 10 44 2021-09-10 2 ## # … with 990 more rows ``` ] .panel[.panel-name[Answer Step 1] .pull-left[ ```r engagement %>% group_by(id) %>% #filter(row_number()==1 | # row_number()==n()) filter(engage_id==1 | engage_id==n()) ``` ] .pull-right[ ``` ## # A tibble: 964 × 3 ## # Groups: id [703] ## id date_engage engage_id ## <int> <date> <int> ## 1 8 2021-01-07 1 ## 2 8 2021-06-14 2 ## 3 11 2021-02-25 1 ## 4 22 2021-02-01 1 ## 5 24 2021-08-25 1 ## 6 37 2021-05-11 1 ## 7 37 2021-09-05 3 ## 8 44 2021-01-13 1 ## 9 44 2021-09-10 2 ## 10 48 2021-02-12 1 ## # … with 954 more rows ``` ] ] .panel[.panel-name[Answer Step 2] .pull-left[ ```r engagement %>% group_by(id) %>% filter(engage_id==1 | engage_id==n()) %>% mutate(diff = date_engage - lag(date_engage)) %>% arrange(id, desc(engage_id)) %>% distinct(id, .keep_all = TRUE) %>% select(-engage_id) ``` ] .pull-right[ ``` ## # A tibble: 703 × 3 ## # Groups: id [703] ## id date_engage diff ## <int> <date> <drtn> ## 1 8 2021-06-14 158 days ## 2 11 2021-02-25 NA days ## 3 22 2021-02-01 NA days ## 4 24 2021-08-25 NA days ## 5 37 2021-09-05 117 days ## 6 44 2021-09-10 240 days ## 7 48 2021-03-16 32 days ## 8 49 2021-02-09 NA days ## 9 51 2021-09-01 18 days ## 10 52 2021-04-13 NA days ## # … with 693 more rows ``` ] ] ] --- class: left, hide_logo ### Time zones ```r Sys.timezone() ``` ``` ## [1] "America/New_York" ``` ```r head(OlsonNames()) ``` ``` ## [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa" ## [4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera" ``` --- class: left, hide_logo ### Time zones This is an instant in time. ```r ymd_hms("2015-06-01 12:00:00", tz = "America/New_York") ``` ``` ## [1] "2015-06-01 12:00:00 EDT" ``` -- This is the same instant in time. ```r ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen") ``` ``` ## [1] "2015-06-01 18:00:00 CEST" ``` --- class: left, hide_logo ### Watch for dropped time zone info Some operations strip time zone information and display in your local timezone: ```r x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York") x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen") x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland") x4 <- c(x1, x2, x3) x4 ``` ``` ## [1] "2015-06-01 12:00:00 EDT" "2015-06-01 12:00:00 EDT" ## [3] "2015-06-01 12:00:00 EDT" ``` --- class: left, hide_logo ### Change display vs change instant When you want to keep the instant the same but and change the display: ```r x4 ``` ``` ## [1] "2015-06-01 12:00:00 EDT" "2015-06-01 12:00:00 EDT" ## [3] "2015-06-01 12:00:00 EDT" ``` ```r with_tz(x4, tzone = "Australia/Lord_Howe") ``` ``` ## [1] "2015-06-02 02:30:00 +1030" "2015-06-02 02:30:00 +1030" ## [3] "2015-06-02 02:30:00 +1030" ``` ```r with_tz(x4, tzone = "Australia/Lord_Howe") - x4 ``` ``` ## Time differences in secs ## [1] 0 0 0 ``` --- class: left, hide_logo ### Change display vs change instant When you want to change the instant: ```r x4 ``` ``` ## [1] "2015-06-01 12:00:00 EDT" "2015-06-01 12:00:00 EDT" ## [3] "2015-06-01 12:00:00 EDT" ``` ```r force_tz(x4, tzone = "Africa/Nairobi") ``` ``` ## [1] "2015-06-01 12:00:00 EAT" "2015-06-01 12:00:00 EAT" ## [3] "2015-06-01 12:00:00 EAT" ``` ```r force_tz(x4, tzone = "Africa/Nairobi") - x4 ``` ``` ## Time differences in hours ## [1] -7 -7 -7 ``` --- class: left, hide-count # Credits Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) * {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} * [R for Data Science](https://r4ds.had.co.nz/index.html), by Wickham and Grolemund