All right. Welcome to the fourth module of Plant Bioinformatics. This week we're going to be doing some promoter analysis. So, why do we want to do promoter analysis? Well, a gene's promoter composition that is the cis-elements that are found in it will, in part, determines how and when that gene is expressed by allowing the transcription factors that recognize those cis-elements to bind, and thus bring about transcription of the gene, and this in turn can affect development of the plant or help the plant respond to environmental stresses. So, there are lots of different transcription factors in plants, and many different transcription factor families. For instance, there's C2H2 transcription factor family with a 116 members in Arabidopsis, and these require histidine and cysteine to be able to collate a zinc to fold the transcription factor up such that it can actually read the genetic code. There's also bZIP transcription factors in these form the scissor like structures that can read the code in the DNA in this manner, and an example is ABF2. So, there's been a lot of effort in humans to understand the regulatory code, and this is called the ENCODE project, whose goal it is to build a comprehensive parts list of all the functional elements in the human genome including elements that act at the protein and RNA levels and regulatory elements that control cells and circumstances in which a gene is active very ambitious, and there are a bunch of different methods that are being used to carry out the ENCODE project. So, RNA-seq to generate the information about the level of transcripts, things like ATAC-seq to get at chromatin accessibility, ChIP-seq to get at transcription factor binding specificity, and where transcription factors are binding, and things like CHIA-PET or Hi-C to get larger conformational arrangements between different parts of the chromosome. Excitingly, there's a MaizeCODE project, which is coming soon, and this is figure will look familiar from the previous slide, and again we can see that for the MaizeCODE project, similar methodologies are being used to get at aspects of the regulatory code in maize, so RNA-seq, we have ChIP-seq to determine transcription factor binding specificity and location of binding 5C and CHIA-PET to get at these conformational arrangements of different parts of the chromosome. So, in order to determine the plant transcription factor binding specificity, we can carry out methods like ChIP-seq, and in this case what we're doing is we're attaching a tag to particular transcription factor, and then we can use that tag to actually pull down either the transcription factor or also histone, then we can purify the DNA associated bound to that transcription factor. We can do a cross-linking step, or we can pull down the DNA associated with specific histone modification, and then we can sequence the fragments that are associated with either the transcription factor or histone. In the case of ChIP-seq with transcription factor, we can actually use the fragments to determine what the transcription factor binding specificity is. So, we can line up all of the or we can do motif discovery on the fragments, and find out what the binding specificity is, and we can also look at where these motifs are found relative to the start of transcription, and that kind of thing. We can also use something called protein binding microarrays to determine transcription factor binding specificity, and this was done by Tim Hughes and colleagues for Arabidopsis sequences and for some other plant species. In this case what we do is we have all possible 10-mers convoluted onto a microarray. All possible combinations of 10-mers, and then we can hybridize a GST tagged transcription factor to that array, and then we can detect which oligos that transcription factor is binding to, and because of the arrangement of the oligos on the array, we know where the binding is occurring, we can then use that information to determine transcription factor binding specificity. Another method that's recently been developed is called DAP-seq. So, DNA affinity purification-seq, and this was developed in Joe Ecker's lab, and what we do here is we express the transcription factor in E. coli or another system, and that transcription factor is labeled with an affinity tag. We can then expose that transcription factor to genomic DNA or to native genomic DNA with its methylation still intact, and then we can figure out, after sequencing the DNA fragments, to which the transcription factor is bound, what the motif is of the binding specificity of that particular transcription factor is, and also whether or not it's sensitive to methylation. So, in the case of the PBMs, they were able to generate 313 motifs encompassing 745 transcription factors, and DAP-seq was done on 343 transcription factors, and quite incredibly 75 percent of these are 5-methyl cytosine sensitive. So, it's pretty nice and astonishing fact from the DAP-seq paper. We can also determine whether or not a transcription factor is able to bind to a promoter, using yeast one hybrid system, and what we do in this case is we have an activation domain attached to our transcription factor of interest, and then we fuse the promoter for our gene of interest upstream of a reporter gene, and if that transcription factor is able to bind to the promoter, we then get transcription of the reporter gene, and we can have like a blue readout in our yeast cells, and that indicates then that the transcription factor is actually binding to the promoter. In fact, you can get this done as a service at the Brady Lab at UC Davis for plants. So, it's possible with about 2,000 Arabidopsis transcription factors. All right. So, how can we describe transcription factor binding specificities? Well, with ChIP-seq data and with DAP-seq data those methods measure physical interactions, and can tell you where in the genome that interaction is, and the binding specificities of the transcription factors that were used either for ChIP-seq or DAP-seq can be deduced from the bound sequence information. In the case of the protein binding microarray method described on slide 7, that can let you know the binding specificity of a given transcription factor, but not where or when it actually binds in the genome because it's an in vitro system. Yeast one hybrid tells you that a certain transcription factor can actually bind to a given promoter and bring about transcription, but it doesn't tell you which cis-element or whereabouts in that promoter it binds. All of ChIP-seq, DAP-seq and the protein binding microarray method can be used to generate position-specific scoring matrices of transcription factor of binding specificities. So what does that mean? So, let's say we have a set of sequences that are bound by a given transcription factor and we can identify the commonalities in those sequences using something like Gibbs sampling that we talked about in Bioinformatic Methods II, the last lab of Bioinformatic Methods II, we can create an alignment of those sequences, and then we can convert that alignment to a position specific scoring matrix. What we do in this case is we, for each position in the alignment, we just count the numbers of A, C, Gs and Ts at that particular position and we enter the number in this matrix. So, at position one we have five A's, at position two we have five C's, position three we have five Gs and so on, but we see at position six that we actually have a mixture of nucleotides that seem to be possible at that particular position. We can also represent this information as a sequence logo, and we actually talked about sequence logo in Bioinformatic Methods II Lab 1. But basically in the case of nucleotide information, the maximum information that we can capture that is the degree of conservation is two bits, and we can see quite nicely that in the case of the first position we have absolute conservation of the A at that position or requirement of A at that position, same thing for C, G and T, then at positions five and six there's variability allowed. The information content is less per requirement for one specific nucleotide and we can see that the preference if there's any based on the height of the individual letters within the stack, so G is slightly more preferred over C in the case of the fifth position. We can also convert this PSSM or MSA into a regular expression, something called a regular expression. In here, what we do is for each position if it's conserved or absolutely required, we just denote it with a given nucleotide and then we denote these wobbles within square parentheses. We can use either the PSSM or the regular expression to search against promoter sequences to see if that particular pattern is present in those promoter sequences. All right, we can also computationally identify potential transcription factor targets and promoters, and there's a nice tool that's been published. It's called FIMO, and what FIMO takes as input is a PSSM. So, here's a representation of that PSSM, and then if we scan against all of the promoters, we get this sort of output which tells you what the sequence that was matched is, the P-value, the Q value, and the start and end of that sequence on the chromosome in this case. However, is every identified site a cis-regulatory element? Well, so that's a good question, and here's an interesting study from Wen-Hsiung Li's lab at the University of Chicago and Academia Sinica. So, what his lab did is they collected 586 position weight matrices for 400 Arabidopsis transcription factors from various databases, and then used FIMO, the FIMO tool to find matrix in Arabidopsis promoters. What they did which was insightful is to apply a conservation criterion of that element also being present in promoters of related species, so Brassicaceae species such as Arabidopsis lyrata, Brassica oleracea, or Brassica rapa. You can actually explore this conservation with a tool that my own lab has developed called Gene Slider and well we'll see that in the lab. What they found was that if this conservation criterion is met, then there is a positional disequilibrium of the promoter of this element distribution along the promoter with a peak at around minus 50 base pairs, and it's known that the region around minus 50 base pairs is quite important for regulating gene expression. If we just look at the distribution of these mapped elements along the promoter without this conservation criterion, we see quite a flat distribution. So this is actually a nice way to perhaps narrow down the set of mapped sites that might be functionally relevant. But ultimately, if we want to determine the functionality of identified promoter cis-elements we probably should do reporter gene assays. So, in this case what we do is we would fuse our regulatory sequence to be studied (so the gene's promoter), upstream of a reporter gene such as GFP or luciferase, and then we would transform that into a plant or do transient expressions, mRNA would get produced if this is an active promoter and we could look for a fluorescent protein production. Then the complimentary experiment that you would do is to knock out a particular cis-element that you think is acting, and then ask whether or not you actually got production of the reporter gene under the conditions that you think it should be induced. All right, so like the wet-lab methods, online resources have different pros and cons. So for instance, ePlant that we'll be exploring has yeast one hybrid and DAP-seq data encompassing 2.8 million protein DNA interactions in Arabidopsis. However, the binding location within the promoter is not available. In the case of Cistome, we have protein binding microarray data and JASPAR data for 745 or 48 transcription factors mapped to promoters with FIMO, but the DAP-seq data is not available. You can get a nice positional distribution graph in the case of system, in the case of Athena that relies on PLACE data, older PLACE data but those data are based on smaller scale studies and there's a nice visualization tools, so we'll be exploring that in the lab. PlantTFDB has 674 manually curated filtered PWMs and two within this are ATRM, can identify transcription factor promoter interactions. PlantPAN has more than 1000 PWMs across 96 species and it really has this cool cross-species promoter viewing tool. I encourage you to check it out, and last, TF2Network has 1,793 PWMs for 916 TFs and we're going to cover this actually in Module 6. So, other TF databases don't really have information on binding or are outdated, and definitely see Lab 6 of Bioinformatic Methods II for some useful JASPAR and Cis-BP tools, and finally keep an eye on the maize code and Ecker lab projects. They are at the bleeding edge of this area of research. Hope you enjoy the lab, thanks.