Hi, I'm Hadi Kharrazi. I'm a faculty in the School of Public Health and also School of Medicine at the Johns Hopkins University. In this video, you will learn about common data types, and more specifically demographic data, diagnostics, and medications. So, what is the key takeaway from this video? You will learn how to identify the various data types that are used for clinical research including both traditional and nontraditional. When we talk about traditional, it means common data versus the nontraditional or the emerging data types. It's good to know these because if you have a big database with all of these data types, and you're trying to run a query to find a trend, a denominator of patients or some other information, it would be good to know what data types you're dealing with. So, what are the common or traditional datatypes? Here is a list, this is not an exhaustive list but this is very much what you will see out there. We have demographics, diagnostics, medications, procedures, surveys, and utilization factors. The last three, we will discuss it in a different video, but we will discuss the first three; demographics, diagnosis, and medications in this video. While I'll go through each of these data types, for each of them I will go through these specs. I will talk about what variables are in that data type group, so I will list you the common variables for it. I will give you a bit of a background, what is the general trend of the variable, what it means on a population level, and then what are the derived variables, variables that you can derive from those variables. There are different methods that you might extract a derived variables for different research items from these data types. For example, you might extract prescription patterns or adherence from medication data, and then I'll talk about the coding standards used for these data types. So, you can see a bunch of coding standards here, like the International Classification of Diseases or ICD, the Systematic Nomenclature of Medicine or SNOMED, we have the RxNorm which is for medications, and DSM, NDC, and ATC, which I will describe throughout the lecture. The other specs of the data types that I will discuss are which data sources are very common to find those. I'll talk about data sources such as insurance claims, EHRs, health information exchanges, and so on. I'll talk about data quality issues that might be associated with those data types. Also data interoperability issues on how those data types might be affected because the bigger databases could not be either collected, integrated, shared or linked with each other. Finally, for each of these data types I will also give you hints about the legal concentrations, whether there are issues with privacy and security that may impact accessing or utilizing these data types, more specifically the Health Insurance Portability and Accountability Act or HIPAA. So, for demographics, there are a number of variables that you would see in various databases, age, sex, gender, race, ethnicity, and they have also expanded demographics to include other things, sometimes you will see even educational level, income level, and so on. However, those are not the common variables for the state of time. In terms of a little bit of a background, there is an increasing life expectancy in the US, so that's a trend of the age. Just to know, the rest, there is not a very specific trend per se. In terms of the derived variables, some of these might be coded in a specific way, for example age might not be in year for example or a month, it might be in different groups, ranges or sometimes you may infer age based on some other criteria, like somebody having a Medicare insurance that might imply that they are 65 plus. Now, there are no coding standards per se, but not always age is being stored as a date of birth. A lot of times is in years or other formats, especially because of HIPAA which I will explain, and sex or gender might be coded differently, in different databases but it's often zero, one, two, standing for unknown, female or male for the one and two, and sometimes there are other digits or not at all or something else to indicate it's missing. Almost all of the clinical data sources, if they are patient-level will have age, and sex, and other demographics as needed. So, if you're working with insurance claims data or EHR data, health information exchange data and others, and if they are patient-level usually you see those data being stored. Now, in terms of data quality, usually the data quality is good for disinformation because there are various mandates that they need to collect this data fairly well. But however, if there might be issues with data completeness, sometimes age is missing, sometimes sex is missing, sometimes age is not as granular as you want it because of different reasons, and it's only year, while you want to also know the months or days or something like that. Now, in terms of data interoperability, there is no data interoperability issues that is known out there. The concept of age, gender, and other demographics is well-defined. But if you get into the newer ones like educational level and so on, there might be data interoperability issues, however those are not commonly collected in health data. Now, in terms of legal concentrations, age if it's reported as a date of birth or if it's even in years but above a certain age limit like 85 years old, they are protected by HIPAA, and that means it cannot be publicly available, and researchers need to have IRB approval or ethics approval within their institutions before they can access the data. Sex is not usually a HIPAA protected data type. Now, here is how it looks in a database, this is a table from a typical tabular database, a relational database. You might run a SQL code to query the data and find certain information. You can see that each row represents one patient, each row has multiple columns. You can see sex, date of birth, age, and other variables that might be derived from age, like what is the lower limit of age and higher limit or age band or grouping of age? So, as you can see, certain elements are HIPAA protected. So, you may not really see these in a research database because maybe somebody has removed that information such as the date of birth, but you can see age there in terms of year. If it was rounded to the year then it is called a limited HIPAA element, however here you can see the decimal, so it is still considered HIPPA. Also the data convergence, there might be issues on people rounding up, down age or doing certain things on age that may lose some of the data that you were looking for.