In this session, we will showcase the expression set data class. This is a very important data container in Bioconductor. It's important in its own right and it's also important as a foundation for a lot of other data containers. As an example data set, we'll look at the ALL package, which is a so-called experimental data package. These types of packages re-package existing data from publications into nice, easy-to-use data set [INAUDIBLE]. So there's a data set called ALL and we can try to print it. And from the print method we see that this is an expression set, big surprise on that, and there seems to be around 12,000 features. In this case a features a gene, and a 128 samples. We can get a little more, we can know a little bit more about the experiment by using the accessory function called experimentData. Which prints a little bit of Information that who did the experiment, and what was the title. And there's links to two permanent IDs to specific papers detailing each experiment. More information can be had by looking at the help page. Where we see a little bit of information about what covariance sign the data said and where's the source from. Obviously for most data sets that you use in practice, you don't get them from inside of a [INAUDIBLE]. And the help page is not available. Let's finally explore this data set a little bit. So the most important thing in the data set, depending on how you look at it. One of the most important aspects of the data set is the expressing or the qualification of the expression of the different genes. That can be accessed using the express access function, which return to matrix, it was a big matrix, it has 12,000 rows and 128 columns, so I'm just going to print the first four rows and the first four columns. And here we see on the right, some identified, these names invented by which is microway inventor. For detailing specific details about what is actually being missing. And on the columns, we have these numbers that are sample IDs. You can get the sample names and the feature names by using the sample names access a function. And the feature names. And we printed them all. And well, that's easy to get these names there that we use quite a lot. These are the expression measures. We also very interested in the phenotype data. The phenotype data is the covariance or information about the samples that were being run. We get that by using the [INAUDIBLE] p-data 'p' for pheno, and this is a big data frame. It has 128 rows now, so we are just going to print the top of them. We can see there's information about the samples, such as the sex, the diagnose, the age, and various other types of information that I import in order to interpret the data. We can, usually you're not interested in all the covarients but you want to do some modelling, you want to access a specific covarient. You can do that, since this is a data frame, by using the dollar operator. But you can also just use the dollar operator directly onto the expression set that gives you the the covariate as you can see here. That's a very useful and quick shortcut. An expression set satisfies two dimensional subsetting. The first dimension gives you features, and the second dimension gives you samples. So this type of subsetting, we are selecting the first five samples. And you can see the output that I get back all 12,000 genes, [COUGH] but only five samples. In the same way, I can ask for the first 10 features. And now I get back an expression set that has 10 features, but keeps 128 samples. And I can of course subset on both things simultaneously and then I get a small data setback. This means that the expressionset is closed on the subsetting when we subset the expressionset we get another expressionset, and this is a very useful feature. So you can think of the P data, data frame has information about samples. In the same way it has information about the samples you can also get information about the features. You access that with something called feature data, not if data, or the one would think so, but feature data. Unfortunately this lot here is, in most instances that I know of, empty. It can contain information about the genes, but in this case, and in many other cases, people don't put the information into the object itself. We'll talk about that in a moment. Now, let's look a little bit about annotating and understanding what was missing on the array. So let's get the first five feature names. So these are the identifiers you've seen before. They don't really make sense. These are in itself, these are identifiers that the microarray vendor decided to put on the different genes. In order to fully understand what genes were missing on the array we have to take these identifiers and map them into gene symbols in a way that makes sense. In Bioconductor this is done through an annotation package. And in this case because the microarray was of a type called HTU95AB2, which was a very widely used microarray. This information is inside the HTU95AB2.epx. We're not really going to go into details about how to use this package, but we'll just point out that it's possible using various maps inside this package to map these metrics, IPs into specific things. So, for example, there's an object that allows you to map into 2ENTREID. So we're going to take this stuff here, and we're going to it like this, and you can see here that we now get something back called ENTREID that we can query. These are meaningful G-net identifiers, we can query them in a public database. Now I'm going to enter a little node of historical interest about the p dataset. So in reality, there's both something called phenodata, and something called p data. P data is really what I recommend people using, it gives you back a data frame. Pheno data gives you this thing called an Annotated Data Frame with sample names and bar labels. So, really the idea was when they created this back in the day, that a state-of-the-art data frame had no information about the different covariants. For example, in this data set there's a covariant called Diagnosis Data diagnosis, but what type of diagnosis? Or let's take that back that back. Bad example here. We're missing age. Some of these are like somewhat straight forward, but right often it's hard to fully understand what is inside a column. So in order to do that they made this data class called Annotated Data Frame that really was a mixture of a data frame and something called Var labels. That contains information about, in this case, not a lot of information, but could contain information about what was missing in different columns. So this is a little bit of historical interest. This is in order to illustrate that pheno data of ALL is not the same as p data of ALL. And that pheno data actually contains the p data slot. We can say p data on phenoData of ALL. Then we get back the full p data data frame. This is what I have to say about expression sets. I hope the last part was not too confusing. This is a very, very widely used type of data container, that has proven to be immensely successful and important to using Bioconductor. The key thing is that it keeps the expression information, and the phenotype information together in a sensible way.