In this session, I will introduce the GEOquery package. This is a package for interfacing with MCBIGU, or gene expression omnibus. GEO is a widely used repository of public data. And unlike what the name suggests, it contains other data types than gene-expressing data. There's also a lot of epigenetic data in there. GEO is a data set that is a repository where different data sets are stored and have an accessing number. So generally in GEO, you have data set from a paper. The dataset from a paper can be of different kinds. For example, they can be a subdata set for RNA sequencing, and a subdata set for chip sequencing. That'll give rise to two different data sets inside the same series. Each data set can have a number of samples associated with it. There is an assessment number for the super series for the entire data in the paper, there's an assessment number for the sub-series. In this case I have described it as a sub-series for the on RNA series data and a sub-series for the chip sequence data. And then there’s assessing numbers for the individual samples. Most people interested in downloading data from GU want all the data associated with a given publication. The starting pointing of all this is the GEO identifier. So once you have that, it’s pretty straightforward. You look load the package, and you get the data. Takes a little while. And it downloads the data inside a list. The reason it downloads the data in to a list is, as I said before there might be one data set associated with RNA sequencing, and another data set associated with chip sequencing. And in this case that would give you two components of the list. In this case here, the eList has a single element. And we can see here, that it's a serious matrix with this particular system number. And we'll get the data by just getting the first one here, and we can look at it. So while this is really an expression set everything looks great. There's 12,000 features, there's 6 samples. We can even look at the phenotype data associated with the different samples. Usually when you look at that in GEO you get a lot of wierd variables that are not very useful. So I'm just going to give you the names of the data. Which contain information such that the contact phone number of the person who uploaded the data. But also useful information about which samples were actually being run. It's important to understand that MCBIGU operates with both something they call raw data, and something they call processed data. Processed data is fully normalized data, ready for analysis. And in many cases you may or may not, or in some cases we may even not be interested in actually getting the process data. Personally I prefer to get the raw data associated with most publications, and do my own normalization or my own processing of the data. What we get here in the easy way is the process data in the form of a matrix. Also not all data arrives in the form of a matrix, when you're done processing it. And such data you can get easily fully processed data. GEO kind of assumes the data ends up coming into a matrix. The way you get the raw data is you get the supplementary files. And the supplementary files can be anything. It depends on what the upload have defined to be the raw data. There are certain conventions in the field. So for example, Affymetrix microarray gene expression data. We tend to think of the raw data as something called CEL files, which is a binary format. Detailing what was imaged in the array. But for all the data types it may be a little bit more unclear what is raw data, or what is semi processed data. For next generation sequencing, you can think of the fast q files as the raw data, you can think of the bam files, which detail the aligned data. The align reads as a form of processed data. Raw data in GEO terms is called supplementary data. So we have a file called getGEOSuppFiles, and we can use the same Assess the number and it goes online. And does not find anything. That's because I used the wrong number. And the download's a tar archive. A tar archive is a little bit like a SIP archive, but it's more used in the units world. Inside that archive, there's a data. I tend to go outside of r and on tar and see what's in here. So now we have a little file. We have a little directory, where the data has been downloaded into. We can look at that by looking at the elist2. The object here in the row names, we get the file name. And now it's ready for further processing. Where you tend to have to use some different functions for reading tin the raw data, depending on what the raw data really is. So this shows how easy it is to get some data from MCBIGU. There is a similar package for accessing the short read archive. And there's another package for accessing array express from EPI.