Data stored in a healthcare record is often organized by patient, visit, and time. When trying to analyze trends for individual patients or populations, it’s important to quickly manipulate dates. Dates come out of our EHR as a combination of Date, Time, and AM/PM, for example “01/01/2001 12:31:50 AM”. This post is about quickly working with this data in R.
Let’s start by loading the two necessary R packages, data.table and lubridate (note that I use data.table version 1.9.5, can be downloaded by following these instructions):
library(data.table)
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:data.table':
##
## hour, mday, month, quarter, wday, week, yday, year
Next, we’ll define a function that generates a set of dates:
Generate_Dates <- function(size){
require(data.table)
## Generate Date
data <- data.table(mo = as.integer(runif(size, min = 1, max = 12)), day = as.integer(runif(size, min = 1, max = 30)), yr = as.integer(runif(size, min = 1990, max = 2014)))
data[, Date := paste(mo, day, yr, sep = "/")]
data[, c("mo", "day", "yr") := NULL]
## Generate Time
data <- data[, c("hr", "min", "sec") := .(as.integer(runif(size, min = 1, max = 12)), min = as.integer(runif(size, min = 1, max = 60)), sec = as.integer(runif(size, min = 1, max = 60)))]
data[, Time := paste(hr, min, sec, sep = ":")]
data[, c("hr", "min", "sec") := NULL]
## Generate AM/PM
data[, AM_PM := sample(c("AM", "PM"), size, replace = TRUE)]
## Combine
data[, Date := paste(Date, Time, AM_PM, sep = " ")]
data[, c("Time", "AM_PM") := NULL]
## Return dates
data
}
To demonstrate the functions, we’ll use a set of 1,000 dates:
data <- Generate_Dates(1000)
head(data)
## Date
## 1: 6/29/2001 6:39:50 PM
## 2: 10/20/2008 7:51:47 AM
## 3: 10/28/2007 7:54:19 AM
## 4: 7/26/2012 11:25:47 PM
## 5: 7/20/2010 3:39:57 PM
## 6: 2/8/2000 10:11:59 AM
To extract date from the value above (“01/01/2001 12:31:50 AM”), you need to:
1) Pull out the first set of characters from the string
2) Convert that first set of characters from a string to a Date type
For the first step, use tstrsplit() from the data.table package. This function will split the single string column into three string columns, if you separate by a space (" “). I then delete the second and third columns, because we’re only interested in the date.
data[, c("Date", "Time", "AM_PM") := tstrsplit(Date, " ", fixed = TRUE)][, c("Time", "AM_PM") := NULL]
For the second step, use fast_strptime() from the lubridate package.
data[, Date := as.Date(fast_strptime(Date, format = "%m/%d/%Y"))]
All together, to fix 1,000 dates:
data[, c("Date", "Time", "AM_PM") := tstrsplit(Date, " ", fixed = TRUE)][, c("Time", "AM_PM") := NULL][, Date := as.Date(fast_strptime(Date, format = "%m/%d/%Y"))]
data <- Generate_Dates(100000)
system.time(data[, c("Date", "Time", "AM_PM") := tstrsplit(Date, " ", fixed = TRUE)][, c("Time", "AM_PM") := NULL][, Date := as.Date(fast_strptime(Date, format = "%m/%d/%Y"))])
## user system elapsed
## 0.186 0.006 0.192
data <- Generate_Dates(1000000)
system.time(data[, c("Date", "Time", "AM_PM") := tstrsplit(Date, " ", fixed = TRUE)][, c("Time", "AM_PM") := NULL][, Date := as.Date(fast_strptime(Date, format = "%m/%d/%Y"))])
## user system elapsed
## 3.896 0.112 4.047
My code is running on a 13-inch MacBook Pro with 8 GB RAM.