Welcome back everyone. In our previous lesson, we looked at the definition of risk stratification and some of the essential concepts to bear in mind when preparing to conduct risk stratification. Now we will turn our attention from what to the how. In this lesson, I will describe the steps that are necessary to perform risk stratification. These steps come from an article called, An Introduction to Predictive Modeling for Disease Management Risk Stratification by Michael Cousins, Lisa Shickle and John Bander. Although this is a great article, a few of the steps are not discussed in enough detail. For example, there needs to be the first step to remind analysts about the importance of getting to know their data, and looking for appropriate outcomes. Next, it is necessary to pre-process data and check for data quality problems. This step might also involve creating derived fields from combinations of other fields and records. I also think that the article needs to remind readers more about the importance of using groupers. Yet all that said, this article nicely describes many of the critical steps required to perform risk stratification. So, let's review what Cousins and his colleagues advise. We will have to fill in some of the details as we go. At the end of the lesson, you will be able to list and articulate the meaning of five important initial steps when performing risk stratification. Build your own or buy stratification software, select a target variable, considering groupers, specify time periods, and evaluate candidate predictor variables. Let's begin with that first step. In general, most health organizations have their own data, yet many do not have algorithms to stratify members or patients by risk. Thus in step one, analysts should ask, "Should our organization purchase a risk stratification system, that are sold with specific groupers, or should our organization use an open-source groupers system and create our own models." The answer to these questions will depend on financial resources and the available analysts who have the time and skills to create homegrown models. In addition, it is important to consider the degree to which commercial options satisfy the business or research needs. Let's assume that you are part of a team that has been tasked with building algorithms. That's when you would apply step two, by selecting target variables. As you know from your data science and statistical work, predictive models especially supervised modeling, requires a target variable. As discussed earlier, analytics should be driven by specific objectives and specific problems that need to be solved. Thus the target variable should measure what is important to stratify. For example, if you want to identify which patients are likely to be associated with the highest cost, your target variables should be related to cost outcomes. Outcomes or targets can be related to all aspects of health services, such as access, cost or quality. Moving on to step three, you should consider if your stratification models require groupers. As discussed in our earlier lesson on groupers, it is impossible to model hundreds or thousands of fields. You can imagine how much more complex tree models get with hundreds of variables, or how difficult it would be to interpret a regression model with hundreds of fields. Thus as we learned earlier, groupers make health care data manageable for analysis. In addition, aggregation of codes into categories also leads to more reliable models. This is because small and unstable numbers of observations, get aggregated into groups with more stable statistical properties. Okay. So, I've reminded you that groupers could be useful in your stratification efforts, yet you need to remember that different groupers can have a big impact on the stratification results. For example, the Hierarchical Condition Category or HCC groupers, categorizes ICD codes differently than the Adjusted Clinical Groups or the ACG grouper. Why do different groupers lead to different stratification results? Well, one reason is that groupers differ with respect to how different data fields and values are aggregated to form categories. In addition, sometimes grouping is driven by data quality issues or the noise, and sometimes by real associations with health and health care patterns or the signal. Different groupings of data could have big impacts on the final model. Overall, some group or categories might be better or worse at predicting specific outcomes. This means that even the best and most expensive groupers, might not be perfect for all situations. Similarly, a free and open source grouper might be very good for a specific outcome. Okay. You know you should consider groupers, and you know that different groupers will lead to better or worse stratification results. Thus you need to think about how to evaluate the different groupers. So, which specific grouper should we choose for our models. The best advice is to remember that groupers were often created for specific purposes. Thus as discussed, grouper effectiveness will depend on the specific outcomes and the data being used. It might help to make choices for your groupers by first studying the published papers and documentation related to the groupers. It might be that the groupers developed for a health outcome, and that their documentation will show how and why this works. Regardless of documentation, there are often complex patterns within health care data sets. Thus it might be necessary to experiment. Thus you should try running several different models that use different groupers, and then evaluate the results. Of course, you always need to be wary that overfitting the data, could potentially be a problem. Step four, specifying time periods might be the most important step. If the objective is to predict costs of diabetic patients six months in the future, then analysts should select an outcome or target variable, that is six months in the future. The time series aspect of the data can be used to structure models that really are predicting targets in the future. Of course by future events here, we mean future relative to a reference point in the past. In other words, by future time periods, I refer to a reference period and not necessarily a future time from the current calendar date. Overall, careful attention to time periods can reduce biases within risk stratification models. As discussed earlier, if an analyst models different time periods to predict future costs, they may reduce problems associated with regression to the mean. In other words, if you have data for multiple years, you can model which patients remain sick through time. Let me clarify the important aspects of specifying time periods and risk stratification models, by saying more about the temporal aspects of data. This table illustrates that we might want to predict an outcome such as costs for the year 2016. In this example, this time period is in the future from the reference point of the target data. For example, here it's 2015. Assuming it is reasonable to use past data to estimate future costs associated with the outcomes, your task is to pick the specific time frames associated with the data. In this example, 2014 and 2015 data used, and are soon to be similar enough to the 2016 data, so that these years can be used to estimate costs. This assumption might not always be correct, so as an analyst you need to think about if the past data really can be used to understand the future. Step five, is Select Candidate Predictors. Selecting candidate predictors or independent variables for the model is a critical step. The objective is to select predictors or independent variables without over-fitting to the specific dataset. An example of some variables that you might include in a model include age, gender and HCC grouper categories. Of course, you might want to evaluate dozens of the HCC grouper categories, but it might be problematic to include all of them in the final model. Once candidate predictors are selected, the modeling process can begin. But, we will get to that important step in the next lesson. Okay. I hope that this has been a good start to our description about how risk stratification is performed. We will complete this discussion in the next lesson. I will see you soon.