So, we talked about the demographics as the first common data type. The next common data type is diagnosis. Let's go through the specs of this data type. There are multiple places in a health data source that you might find diagnosis. Of course, you can see a lot of diagnostic codes in the encounter table that indicates the patient has a disease or not. However, diagnostic codes could be also used for signs and symptoms, for injuries, for health status factors, family history, problem list, and so on. So, diagnostic codes might be in multiple tables in a database. Now, just a bit of a background. It's good to know that chronic diseases are on the rise in the US, and you might find the same trends in your database as well. There are a bunch of derived variables that you can conclude based on the diagnostic codes such as severity of a diagnosis, trajectory of it or a history of it. However, a lot of this requires some clinical knowledge of understanding how you can merge some of these diagnostic codes or not. There are a number of coding standards available and there are commonly used here in the US and also other countries like the International Classification of Diseases or ICD, or International Classification of Primary Care, ICPC, which is more international out of the US setting. We have the Systematic Nomenclature of Medicine or SNOMED, which is a more comprehensive data standard that has a lot of different perspectives about a disease. We have DSM, which is the Diagnostic and Statistical Manual of Mental Disorders, and also other coding systems, again this list is not exhaustive that could be used in other countries like Read Codes that are heavily used in the UK. So, let's continue with the specs of the diagnostic data type. You can see that the common data sources for diagnosis of course are insurance claims and EHRs, and we will talk more about these data sources in future videos. We have also data quality issues, however because a lot of the diagnostic codes are regulated at least here in the US and a lot of reimbursement is attached to it, most of the diagnostic codes are reliable and they have good quality, but again the quality might be different from one data source to another because data sources might have different purposes, and for the same reason they might have different qualities for different data types including diagnosis. There is also data interoperability issues, where you want to translate a diagnostic code from ICD to SNOMED, it might be difficult or even to translate a code from one version of a diagnostic coding system to another version. There are legal considerations for very specific diagnostic codes that might be indicative of certain diseases like HIV or mental illnesses here in the US. However, overall, there are no other heap of concerns about just diagnostic codes. Here is a sample table from again a database that each row of it has a patient, you can see there is a patient identification number. There is the ICD code version, meaning that this is ICD-9, it's not ICD-10, and so there are multiple versions of these coding standards. You can see when a visit happened, that's a date there and then there are nine diagnostic codes in front of it. Of course, this is more of a research table, this is not a normalized database, which means it's not scalable. For example, what if in a visit, a patient has more than nine diagnostic codes, then what should I do? Should I add a new row or is there are new column there that I can add? Which is not possible at least for these relational databases. So, you can see all diagnostics are in one row in front of one patient. Again, this is not a normalized table, but still a lot of researchers use these diagnostic tables. If you look at arrow one, you can see that the patient has also some diagnostic codes that start with the E for ICD-9 codes, which means an external cause of injury. If you see arrow number two, there is a diagnostic codes starting with the V, which is a factor influencing health status. So, you need to know how this ICD code in this case works, the classification, the special notations and how they write or the naming conventions of the different diseases of how they put numbers on them and they code them in order to understand what these tables mean. Of course, if you look at arrow number three, there are instances that some diagnostic codes were never filled because the patient didn't have that many, and number four actually you can see more. As an example, you can see one of the ICD codes there, it's 997.49, and if you look up the ICD classification system, you can see it means other digestive system complications. So, I talked about issues with interoperability and mapping codes. Here's an example. So, ICD-9 was in effect here in the US until around October 2015, and then the switched happened to ICD-10. You may ask why did we switch to a new classification system? It's because it has more information and it helps us to refine more diseases, and of course, sometimes it needs to be updated, sometimes new classifications of diseases move around or there is a completely new category that they start. So, you can see on ICD-9 for example, sometimes it's a one-to-one mapping. That means one code translates to a nice ICD-10 and that was an easy match, but sometimes as you can see, number two, three, and four, one disease in ICD-9 might translate into three, multiple or even thousands of different ICD-10 codes. So, it is not as easy as just a simple lookup table where you can match different versions of a coding system. That makes it very complex when you work with databases that may have different versions of a coding system like ICD. So, the last common datatype that I'm going to talk about is the Medications. You can see the specs. Depending on where the data source is, that might be a prescription like in the EHR, or whatever the medication was actually filled which is under claim side. In terms of a background, there are always new medications coming in, and there's always updates almost on a daily basis to some of the medication coding systems, and overall the cost of medication is being increased in the US because of the new biologic medications that are very costly at least for now to produce them. Now, there are a lot of derived variables that you can conclude from this type of data, such as medication adherence rates, you can see two examples here which I will come back and explain in the next slide. Now, the coding standards, you can see some of them here like the National Drug Codes, Rx normalized codes. Again we have SNOMED especially the chemical axis of it, we have ATC, which stands for Anatomical Therapeutic Chemical Classification System, and we have some that are more or less commercial coding systems like MediSpan Multum, the generic product identifier, which is GPI and also the first database. Now, data source as I just mentioned that you can look into insurance claims and EHR, but there are also others very specific data sources for medications, and we get talking about patient level on a large-scale a data source, and you can see here there are three examples like prescription benefit management databases, which are a middleware between where an EHR and also an insurance company, there is a short script network here in the US that it's basically manages most of the E prescriptions, where in an EHR when a physician puts an order, then it delivers that digitally to your pharmacy of choice. So, that's short script and they also collect a lot of data on medications that are prescribed and dispensed, and also we have something in the US called PDMP, prescription drug monitoring programs, a lot of states have a central registry. These are usually for drugs that should be monitored because there might be an addiction involved with it under controlled substances. So this way, they can make sure that there's no abuse in the system and they can track patients who might start taking these medications for a long time. Now, there are as usual some data quality issues, however still the data quality of medications is good because of the industry around it and all of the coding and also the reimbursements associated with it. As usual, there's some data interoperability issues specially when mapping is done from one coding to another, and there might be also some legal considerations depending on what medication the patient is taking. Here's a sample table, you can see again each individual row is a medication given to a patient at the specific visit. Here arrow number one is actually listing off the product names. You can see that for example, if you just scroll down and you see there are two medications listed there that it's the same name, but if you just go to the right side of the table, you can see that there are two different packages. So, that is basically an NDC code where not only it identifies what ingredient is the medication made of, but also what packaging it comes in, and that's why NDC codes are heavily used in the claims world because it's all about how many meds are in that package, so that's how you pay for it. So, that's arrow number two that shows the package and description. The mapping codes are very complex when it gets into the medications because each of these coding standards are designed to look at the various specific perspective of a medication. It might be the ingredients that it's important in one coding system. I just told you about NDC and why, packaging is important for us. Could be very different ways to look at medication and that's why mapping decodes across different medication codes is usually a very complex. The last item about the type of data that you can derive from medications. Here are three examples. So, if you look at the top row, a clinician visiting at the point of care a patient and they write a prescription, based on only the prescription data in the electronic health record system or the EHR, you can calculate an index called MRCI, and you can actually do this all on a SQL level if needed. So, that's called the medication regimen complexity index, meaning that how hard it is to adhere to that prescription regimen because it's sometimes too many prescriptions, too many doses. They're hard routes of administration that makes it harder for a patient to adhere to that prescribed medication regimen. So, the second row you can see the pharmacist filling a medication and that shows up usually on the claim side, because that's how the patient pays for it. Based on that data, you can calculate something called medication possession ratio, meaning that whatever the patient filled in or not, and how many days in a year they have their medication or not, and the last item you can see in the last row, it's the patient medication adherence. It might happen in EHR, the data quality is usually sketchy these days, and that's a medication adherence where met drag medication reconciliation record where a nurse talks to a patient and they write whether the patient is really taking a medication that is being prescribed for or not. So, all of these are derivative data sources that you can calculate for medication. Now in summary, in this video, we talked about the common data types and different specs that we will review for each of them. More specifically, we talked about the demographic data, agents, and sex, and gender, we talked about diagnosis and also medications, both prescriptions and dispense medications. In the future video, we'll talk about procedures, survey's, and utilization. Thank you.