The Centers for Medicare & Medicaid Services (CMS) hierarchical condition categories (HCC) model, implemented in 2004, is a risk-adjustment model used to adjust Medicare payments to health care plans for the health expenditure risk of their enrollees. It’s intended use is to pay insurance plans appropriately for their expected relative costs. For example, health plans that care for overwhelmingly healthy populations are paid less than those that care for much sicker populations.

The full model includes variable interactions, demographics (e.g., age, gender), and indicator variables for Medicaid enrollment and disabled status. However, this post will only cover using the HCC model to cluster diagnosis codes into meaningful categories. We’ll implement HCCs for ICD 9 codes using the 2014 model, but once ICD 10 is implemented in my environment, I’ll update the post.

To see the different versions of the model, please visit:
- 2014 Model
- 2015 Model
- ICD 10 Mapping

Classification System

The HCC diagnostic classification system has four components:
- Classify over 14,000 ICD 9 diagnosis codes into 805 diagnostic groups, which represent a well-specified medical condition.
- Diagnosis groups are further aggregated into 189 condition categories, which describe a broader set of diseases. Diseases within a condition category are related clinically and with respect to cost.
- Only use the subset of 189 condition categories that best predict Medicare Part A (inpatient) and Part B (outpatient) medical expenditures. The most recent version (version 22) of the CMS HCC model includes 79 of the disease categories, excluding those that contain diagnoses that are vague/nonspecific (e.g., symptoms), discretionary in medical treatment or coding (e.g., osteoarthritis), not medically significant (e.g., muscle strain), or transitory/definitively treated (e.g., appendicitis). The model also excludes categories that do not empirically add to costs, as well as categories that are fully defined by the presence of procedures or durable medical equipment. This focuses the model on medical problems that are present, rather than services offered.
- Hierarchies are imposed among related condition categories, so that a person is coded for only the most severe manifestation among related diseases. For example, there are four condition categories related to Ischemic Heart Disease; acute myocardial infarction (81), unstable angina & other acute ischemic heart disease (82), angina pectoris (83), and coronary atherosclerosis (84). A patient with an ICD 9 code in condition category 81 is excluded from being coded in categories 82, 83, or 84, even if diagnosis codes in those categories are present.

What you’ll need

  1. Extract a list of patient ICD 9 diagnosis codes from your EHR. The code below assumes this file has 4 columns (Patient_Identifier, Encounter_Identifier, ICD_Diagnosis_Code, and Diagnosis_Date). To identify diagnoses, you’ll have to decide on a window of time to look at. For most cases, we look at two years worth of ICD 9 codes, but you may be interested in smaller or larger time periods. In the code below, this data set is called ‘ICD9_Codes’.
  2. Download the 2014-Final Model and unzip the “CMS-HCC software V2213.79.L2” folder, which contains all the files to implement version 22 of the HCC model.
  3. Set the “hcc” directory to the file path for the “CMS-HCC software V2213.79.L2” folder and the “ehr” directory to the file path containing your ICD 9 diagnosis codes.


Load data.table, lubridate, and stringr:


Step 1: Build HCC label table

The function below converts the text in the SAS file V22H79L1.txt into a table with two columns. The first column is the HCC category and the second column is the HCC category label.

HCC_Labels <- function(hcc){
    ## Set working directory
    ## Load packages
    ## Read lines of SAS file
    labels <- readLines("V22H79L1.TXT")[9:89]
    ## Combine lines that refer to same HCC
    labels <- paste(labels, collapse = " ")
    labels <- unlist(tstrsplit(labels, " HCC"))
    ## Convert to data table
    labels <-

    ## Change column name
    setnames(labels, "labels", "HCC")
    ## Separate each line into two variables
    labels[, c("HCC","HCC_Label") := tstrsplit(HCC, "=", fixed = TRUE)]

    ## Remove white space
    Trim <- function(x){str_trim(x, side = "both")}
    labels <- labels[, lapply(.SD, Trim)]
    ## Remove quotations from label
    labels[, HCC_Label := gsub('"', "", HCC_Label)]
    ## Remove letters from HCC variable
    labels[, HCC := as.integer(gsub("HCC", "", HCC))]
    ## Return label table

Run the function above and assign the output to a table called labels:

labels <- HCC_Labels(hcc)

Step 2: Classify ICD 9 Codes Into Condition Categories

This step uses ICD9 codes extracted from your EHR and the F2213L2P.TXT crosswalk file to group ICD 9 codes into the 79 relevant condition categories included in version 22 of the CMS HCC model.

Assign_CCs <- function(ICD9_Codes, hcc){
    ## Load packages
    ## Read in crosswalk
    crosswalk <-"F2213L2P.TXT"))
    ########## Clean ICD 9 codes
    ## Fix column names of ICD9 codes
    if(names(ICD9_Codes) != c("Patient_Identifier", "Encounter_Identifier", "ICD_Diagnosis_Code", "Diagnosis_Date"))
        setnames(ICD9_Codes, names(ICD9_Codes), c("Patient_Identifier", "Encounter_Identifier", "ICD_Diagnosis_Code", "Diagnosis_Date"))
    ## Fix dates on ICD 9 codes
    if(class(ICD9_Codes$Diagnosis_Date) != "Date")
        ICD9_Codes[, c("Diagnosis_Date", "Time", "AM_PM") := tstrsplit(Diagnosis_Date, " ", fixed = TRUE)][, c("Time", "AM_PM") := NULL][, Diagnosis_Date := as.Date(fast_strptime(Diagnosis_Date, format = "%m/%d/%Y"))]     
    ## Remove periods
    if(sum(grepl(".", ICD9_Codes$ICD_Diagnosis_Code)) != 0)
        ICD9_Codes[, ICD_Diagnosis_Code := gsub("\\.", "", ICD_Diagnosis_Code)]
    ########## Clean crosswalk
    ## Change names
    setnames(crosswalk, names(crosswalk), c("ICD_Diagnosis_Code", "HCC"))
    ########## Merge ICD 9 codes with crosswalk
    ICD9_Codes <- merge(ICD9_Codes, crosswalk, by = "ICD_Diagnosis_Code")

    ## Return ICD 9 codes

The merge above is an inner join that assigns ICD 9 codes to condition categories. Note that many ICD 9 codes you extract from the EHR will be dropped, because they do not correspond to condition categories that are relevant for the model. For example, when I ran the function above on a sample of ~840,000 ICD 9 codes extracted from the EHR, only ~135,000 are assigned to condition categories. This should emphasize that HCCs should only be used to identify the prevalence of chronic conditions that are defined in the model.

Run the function above and pass the result to a table called ICD9s_CCs:

ICD9s_CCs <- Assign_HCCs(ICD9_Codes, hcc)

Step 3: Identify Condition Categories Across Population

Now that the ICD 9 codes are classified by patient and by condition category, you can calculate the presence of the 79 condition categories for each patient:

Collapse_CCs <- function(ICD9s_CCs, labels){
    ## Load data.table
    ## Count diagnosis codes for each patient in each category
    cc_by_patient <- ICD9s_CCs[, .(Freq = .N), by = c("Patient_Identifier", "HCC")]
    ## Only keep categories where patients have more than one diagnosis code
    cc_by_patient <- cc_by_patient[Freq >= 2]
    ## Only keep the columns for patient and HCC
    cc_by_patient <- cc_by_patient[, .(Patient_Identifier, HCC)]
    ########## Convert long and skinny table to long and fat
    ## Identify all the HCCs for which you have to build out dummy variables
    dummies <- as.character(labels$HCC)
    ## Build out dummy variables for each category
    cc_by_patient[,(paste("HCC", dummies, sep = "")):=lapply(dummies,function(x) as.numeric(HCC==as.character(x)))]
    ## Drop HCC variable
    cc_by_patient[, HCC := NULL]
    ####### Collapse data to 1 row per patient
    cc_by_patient <- cc_by_patient[, lapply(.SD, max), by = Patient_Identifier]
    ## Return data for all 79 CCs for each patient

Note that the function above takes labels (the output of step 1) and ICD9s_CCs (the output of step 2) as inputs. In return, the function builds a wide table with one row for each patient and 80 columns. The first column is patient identifier and the following 79 columns are for each of the condition categories. A value of 1 indicates presence of a condition and a value of 0 indicates absence of a condition.

Run the function above and assign the output to Condition_Categories:

Condition_Categories <- Collapse_CCs(ICD9s_CCs, labels)

Step 4: Build a Hierarchy Table

So far, we have been working with condition categories, NOT hierarchical condition categories. In the table built above, Condition_Categories, it’s possible for patients to have “Diabetes without complication” as well as “Diabetes with chronic complications”. We’ll fix this issue by requiring patients to only have the most severe condition within a group of related conditions.

To determine the hierarchical structure of the condition categories, we’ll build a table from the V22H79H1.TXT file. The table will have three columns: Condition Name, HCC, and a list of HCCs to zero if a patient has a given HCC in the second column. For example, a patient with a value of 1 for HCC 8 (“Metastatic Cancer and Acute Leukemia”) has 0 values overwritten for HCC 9 (“Lung and Other Severe Cancers”), HCC 10 (“Lymphoma and Other Cancers”), HCC 11 (“Colorectal, Bladder, and Other Cancers”), and HCC 12 (“Breast, Prostate, and Other Cancers and Tumors”).

HCC_Table <- function(hcc){
    ## Load packages
    ## Load V22H79H1.txt
    Hierarchy <-"V22H79H1.txt")[28:58])
    ## Change name of first column
    setnames(Hierarchy, "V1", "Condition")
    ## Split out condition categories to zero
    Hierarchy[, c("Condition", "HCC", "To_Zero") := tstrsplit(Condition, "%", fixed = TRUE)]
    ## Strip away extra characters in HCC variable
    Hierarchy[, HCC := gsub("=", "HCC", str_extract(HCC, "=[0-9]+"))]
    ## Strip away extra characters in To_Zero variable
    Hierarchy[, To_Zero := gsub("STR|\\)|;| ", "", To_Zero)]
    ## Add HCC before every number To_Zero
    Hierarchy[, To_Zero := gsub("\\(", "HCC", To_Zero)]
    Hierarchy[, To_Zero := gsub(",", ",HCC", To_Zero)]

    ## Strip away extra characters in Condition variable
    Hierarchy[, Condition := str_extract(Condition, "[a-zA-Z]+ *[0-9]")]

    ## Trim away white space from all columns
    Trim <- function(x){str_trim(x, side = "both")}
    Hierarchy <- Hierarchy[, lapply(.SD, Trim)]
    ## Return table

Run the function above and assign the output to Hierarchy_Table:

Hierarchy_Table <- HCC_Table(hcc)

Step 5: Implement Hierarchy

The last step is implementing the hierarchical structure of condition categories (output of step 4) on the “Condition_Categories” table (output of step 3), which shows patient-level diagnoses for each of the 79 condition categories.

Hierarchy <- function(Hierarchy_Table, Condition_Categories){
    ## Get parameters for loops (number of hierarchy conditions and number of patients)
    Hierarchy_Count <- nrow(Hierarchy_Table)
    Patient_Count <- nrow(Condition_Categories)
    ## Loop for each hierarchy condition
    for (i in 1:Hierarchy_Count){
        ## Get the name of the column the hierarchy condition is based on and the vector of columns to zero
        colname <- Hierarchy_Table$HCC[i]
        zero_list <- unlist(strsplit(Hierarchy_Table$To_Zero[i], split=","))
        ## Loop through all the columns to set to 0
        for (k in zero_list){
            set(Condition_Categories, which(Condition_Categories[[colname]] == 1), k, 0)
        ## Show progress through conditions
    ## Return Hierarchical Condition Categories

Run the function above and assign the output to HCCs:

HCCs <- Hierarchy(Hierarchy_Table, Condition_Categories)


And there you have it! For a list of patients, you’ve converted ICD9 codes into 79 chronic diseases.