## Intro

The Clinical Classification Software (CCS) for diagnosis codes is a software tool developed as part of the Healthcare Cost and Utilization Project, a Federal-State-Industry partnership sponsored by the Agency for Healthcare Research and Quality.

In this post, we’ll deploy the CCS categorization scheme to International Classification of Disease (ICD) 9 diagnosis codes to identify diabetes and high blood pressure. This scheme collapses over 14,000 diagnosis codes into roughly 250 categories that are more useful in health data analysis.

To get a sense of what the scheme looks like, check out the single-diagnosis appendix, which maps sets of ICD 9 codes to diagnosis categories.

To see how ICD codes are organized, check out this public reference. Note that it’s a hierarchical scheme, where each additional decimal place conveys increasingly granular information.

## What you’ll need

1. Extract a list of patient ICD 9 diagnosis codes from your EHR, typically along with patient identifiers, dates, and encounter identifiers. To identify diagnoses, you’ll have to decide on a window of time to look at. For most cases, we look at two years worth of ICD 9 codes, but you may be interested in smaller or larger time periods. In the code below, this data set is called ‘data’.
2. Download the Single Level CCS zip file from the CCS website and within that folder, pull out the ‘dxref 2015.csv’ file. In the code below, this data set is called ‘CCS’. ## Packages Load data.table and stringr: library(data.table) library(stringr) ## Clean your ICD9 codes The CCS crosswalk maps ICD9 codes without decimal places (relying on the unique sequence of digits), so you’ll have to remove decimals from your EHR data extract: data[, ICD_Diagnosis_Code := gsub("\\.", "", ICD_Diagnosis_Code)] ## Clean CCS crosswalk You only need the first three columns from the CCS file and must to skip the first line when reading in the table. We’ll also change the names of the data to align with our EHR diagnosis codes: CCS <- fread("~/dxref 2015.csv", skip = 1, select = 1:3)
setnames(CCS, names(CCS), c("ICD_Diagnosis_Code", "CCS_CATEGORY", "CCS_CATEGORY_DESCRIPTION"))

The ICD 9 codes in the crosswalk file include dashes and extra spaces that must be removed:

Remove_Dash <- function(x){gsub("\\'", "", x)}
CCS <- CCS[, lapply(.SD, Remove_Dash)]

Trim <- function(x){str_trim(x, side = "both")}
CCS <- CCS[, lapply(.SD, Trim)]

If you’re not familiar with data.table syntax, note that .SD stands for subset of data and is used to specify the columns to which you want to apply a function. If you don’t specify any columns, the function is applied to all columns.

## Merge ICD9 codes with CCS crosswalk

data <- merge(data, CCS, by = "ICD_Diagnosis_Code", all.x = TRUE)

Now, each ICD 9 code you pulled from the EHR is assigned to a diagnosis category.

## Example: Identifying Diabetes

diabetes <- data[CCS_CATEGORY_DESCRIPTION %in% c("DiabMel no c", "DiabMel w/cm")]
diabetes_patients <- diabetes[, .(Diabetes = .N), by = Patient_Identifier][Diabetes >= 2][, Diabetes := 1]

There’s two steps to the code above:
- Subset ICD 9 codes that are in diabetes CCS categories (diabetes mellitus with complications and without complications)
- For each patient, sum the number of ICD 9 codes in diabetes CCS categories. For patients with more than 2 codes, assign the variable “Diabetes” a value of 1.

## Example: Identifying Hypertension (High Blood Pressure)

hypertension <- data[CCS_CATEGORY_DESCRIPTION %in% c("HTN", "Htn complicn")]
htn_patients <- hypertension[, .(HTN = .N), by = Patient_Identifier][HTN >= 2][, HTN := 1]

There’s two steps to the code above:
- Subset ICD 9 codes that are in hypertension CCS categories (hypertension with complications and without complications)
- For each patient, sum the number of ICD 9 codes in hypertension CCS categories. For patients with more than 2 codes, assign the variable “HTN” a value of 1.

## FAQ:

• Is it really this easy to identify diagnoses? No, this technique has many limitations and there’s an entire field of medical research dedicated to developing computable phenotypes for chronic diseases. However, CCS codes have been used in many publications and are very useful for identifying patterns of disease within a population or tracking the timeline of diagnosis of a disease for an individual patient.
• What are some of the limitations? ICD 9 codes are primarily used as a mechanism for payment and the reliability of the codes is questionable. For example, there are different coding contexts in which codes can be assigned. You may notice that codes assigned by providers at the point of care may be different than codes that make it into the final healthcare bill. I don’t get into those details, but there are algorithms that do take into account the context of the diagnosis code as well as the rank of the code.
• Why a minimum of 2 codes for each patient to assign a diagnosis? Requiring at least two codes addresses some of the limitations described above. If a patient only has 1 code for a diagnosis over two years, we have a high degree of suspicion that the patient hasn’t actually developed the given disease. For example, we’d like to see two different encounters at which providers document the development of complications secondary to diabetes before we change that code. This means there may be a lag in detection of certain conditions, but it’s a balance to achieve more accuracy.
• Why look at 2 years? ICD 9 codes are only assigned at healthcare encounters. Let’s say a patient has high blood pressure, but didn’t see a doctor this year. We wouldn’t detect that patient’s condition unless we used a window of time that was wide enough to capture healthcare encounters.