In this section we'll be addressing how information theoretic ideas can help us to understand how the neural code may be specially adapted to the structure of natural signals. We'll briefly first look at some of the special properties of natural inputs. And then some theories of how code should behave. Finally we'll sum up with some suggestions from the principles that may be at work in shaping the neural code. So I'm going to show you some photos that we taken by one of our Post-Docs, Fred Sue, as he was sitting in his apartment on one of our typical sunny Seattle afternoons, looking out at the view. He tried to take a picture that both encompassed his beautifully furnished apartment and the grand view outside. You can see that he had to change his f stop over a wide range in order to be able to capture information both about the scene inside and about the world outside. Now this is something that our eye does effortlessly. If you were sitting here at this table, you would be able to see both the inside and the outside with perfect fidelity. So looking even at this familiar example, we can see two properties that are characteristic of natural inputs. One is that there's a huge dynamic range. There are variations in light level and contrast that range over orders of magnitude. We can see signs of another property by comparing these two boxes. Because of effects of depth and perspective, there's similar structure, similarly well defined shapes and objects at very different length scales. This is reflected in the power spectrum of natural images. If one computes the power in different spatial frequency components this function has a, this function has a power log form. That is it scales, like the frequency, to the power minus two. This reflects the lack of any characteristic scale. The similar structure are not. Despite these scale differences and the very large variations in light and contrast across the image, we'd like to be able to distinguish detail at every point in it. Unlike this camera. These basic issues arise for almost all of our senses. Here's an audio track of a chunk of speech. The signal is full of complex fluctuations that carry detailed information about pitch and nuance. However, these fast variations are modulated by the relatively huge variations in amplitude that make up the envelope of speech. We're perfectly capable of understanding all of these signal components regardless of the overall amplitude, even when there are multiple speakers, or they're far away. So how can a neural system, with a limited range of responses, manage to convey the relevant information about details in the face of these huge variations of scale? We found that the entropy, we found that the entropy, reached it's maximum, when it was of the form of these two symbols were used equally often. Now if we're thinking about maximizing the mutual information. We also have to take into account this noise term. But generally the amount of noise for a given stimulus may not be something that's easily controlled. While the total response entropy is something that's in the hands of the coder. Let's see how. Let's imagine that the stimulus that a system needs to encode Is varying in time, this is s of t, it has some distribution, p of s over here. Our job as an encoder is to map the stimulus onto the symbols that we have at our disposal. Let's imagine that we're constrained to use some maximal firing rate, so we have some limited range of possible symbols at our disposal, say zero to 20 hertz. How should we organize that mapping so that we end up with the most efficient code? We'll get the most information by maximizing our output entropy. That is, by using all of our symbols about equally often. So what does that imply for the shape of this curve? So what we should do is move along our stimulus distribution and encode equal shares of that distribution with each symbol. If we have 20 symbols lets count up 1 20th of our total area under this curve, and assign that to symbol one. What this amounts to is a response curve that's given by the cumulative integral of the stimulus distribution. Another name for this is histogram equalization. So this implies that for a good coding system, its input output function, this function here, should be determined by the distribution of natural inputs. So here's a classic study in which this idea was tested directly. In the early 1980's, Simon Laughlin went out into the fields with a camera, and measured the typical contrasts, that is deviations in the light level, divided by the mean light level, that would be experienced in the natural world, for example, by a fly. So, that's this distribution here. If the response does indeed follow the distribution of natural inputs... Then the response curve, here, should look like the cumultive probability determined by integrating p of c. And in fact, that's a very good match to what he did actually observe in the response properties of the fly large mono-polar cells, the neurons that integrate signals from the fly's photo-receptors. Now, a study like this poses a challenge. While it makes sense that our sensory systems would, over evolution or development, set up response codes that are adjusted to natural input statistics. It seems that much more work is needed to handle the problems posed by this huge natural variation, that stimuli take as one moves from indoors to outdoors or even moves one's eyes around a room. The contrast distribution is varying widely. Might sensory systems rather adjust themselves on much shorter timescales to take these statistical variations into account. So let's take a patch of the image, and look at the, the variations in contrast in that image. Here for example, that contrast distribution might take, might be narrow like this. Wheras over here, it might be much broader. What our code should do is take the widths of these distributions into account in setting up a local. Input, output curve, that accommodates this structure of the, currently measured statistics of the input. So that's the question that we tested here, in the h1 neuron. In this experiment, we took a white-noise input, of the type that you used in the problem sets, so some s of t. Looks like that. And we multiplied it by some time varying, slowing time varying envelope. Call that sigma of t. And that's what you see here. So we repeated the same sigma of t. This is a 90 second long chunk of stimulus. Repeated the same sigma of t. In every trial, but we changed the specific white noise. Stimulus. And that allowed us to pick out spikes that occurred at different time points throughout this presentation of, of sigma of t, where in every trial the cell would have seen a different specific stimulus. And to calculate the input output function described by those spikes, in those different, in those different windows of time. So now one, when one analyzes spikes across these different windows, and pulls out their input output function using the methods that we talked about in week two, one finds that for example, here in this window, one gets a very broad input and output curve. Where, when the stimulus is varying very little, one finds a very sharp input and output curve. Now, it turns out that if one normalizes the stimulus by its standard deviation, or by this envelope sigma of t, all of these curves collapse onto the same curve. What that says is that the code has the freedom to stretch its input access such that it's accommodating these variations in the overall scale of the stimulus. And it's able to do that in real time as this envelope is varying. This is being seen in several other systems, including the retna and the auditory system. But here's an example from rat barrel cortex. This is somatosensory cortex of the rat. In particular. The part that encodes the vibrations of whiskers. So, from extracellular in vivo recordings of responses to whisker motion, whiskers were stimulated with a velocity signal again, s of t, that looked like this. So this is a slightly simpler experiment. The standard deviation was varied between two different values. And now one can pull out spikes that are generated in these two epochs that presentation. The high variance case and the low variance case. And one can compute input output curves for spikes that occurred under these two different conditions. So in the low-variance case, one sees this input output curve, in the high-variance case, one sees this input output curve. And hopefully you won't be surprised that if I now divide the stimulus. By its standard deviation, we now see a common curve. So now we see again that this input output curve has the freedom to stretch itself such that its able to encode stimuli in their natural dynamic range. So what I've shown you is that as one changes the characteristics of the stimulus. In this case, in the cases we've talked about, by changing its overall amplitude, changes can occur in the input output function. So here we've found that if a stimulus say, took on this dynamic range, it might be encoded with an input output curve like that. Now you should be able to see that if one increased the range of the stimulus and stayed with that same input output curve. Most of the time, your stimuli would be giving responses that were even zero or at saturation point. Similarly, if you now decrease the range of the stimulus you'd be hovering at the central part of the curve. So, ideally one would like to use one's entire dynamic brain by defined by this input output curve. And so, one would like to match it to the range of the stimulus. And that's exactly what we saw in the experiments. Now this adaptive representation of information is not confined to change us in the input output function. It's also been seen that changes can happen in the feature as the statistics of the inputs are changed. The feature that's selected by a neural system can also adapt to changes in the stimulus statistics. And information theory has also been used to explain the way in which this occurs. For example it's been used to explain how the spatial filtering properties of neurons in retina, and in LGN change with light level. Joe Addick and his colleagues pose the following question: If we consider that the retina imposes a linear transfer function, or a filter on its inputs, what's the shape of that filter that maximizes information transmission through the retina? The solution turns out to depend on two things. The powers spectrum of natural images and the signal-to-noise ratio. At high light levels, or high signal to noise, one would predict a filter shape like the one we've seen already, the Mexican hat shape. This acts like a differentiator, looking for edges of the stimulus, but at low light levels, the predicted optimal filter is integrating, and simply averages its inputs to reduce noise. And indeed in retinal receptive fields it's seen that the surround becomes weaker at low-light levels and the center braoder which qualatatively matches these predictions. We can also use information theory to find out what it is about a stimulus that drives a neuron to fire. We looked at this method in week two. In this case, this is called the, the method of maximally informative dimensions. One can choose a filter, so one can extract from the stimulus some component that maximizes the Colbeck-Libler Divergence between the spike conditional and the prior distributions. This turns out to be equivalent to maximizing the information that the spike provides about the stimulus. One can use this method to search for the optimal feature that explains the coating properties of a system. When it's being presented with stimuli of a particular distribution. Distribution. So for example if one initially starts with a Gaussian white noise distribution, that's a Gaussian, that's vertical Gaussian, in this, in this representation. One might find a particular feature. But now if one changes the distribution to say natural images, which will have some very different distribution. The filter that maximizes the, the information between spike and stimulus maybe different and that's being shown to be the case for cortical receptive fields among other systems. So, finish up by discussing briefly an influential idea that Ragesh mentioned in the first lecture. That might explain my cortical receptor fields have the shape that they do. Many years ago, Horace Barlow proposed that because because spikes are expensive, neural should be trying to encode systems as efficiently as possible. What does this mean for a popular of neurons? If you consider the joint distribution of the responses of many neurons, here lets just take two. Maximizing their entropy should imply that they code independently. That is their joint distribution should factor into the product of the two marginal distributions. This is a strategy that would maximize their entropy. Why is that? Because the entropy of a joint distribution is always less then or equal to the entropy of the distributions of the marginals added together. So this idea is known as redundancy reduction. The neural system should be optimized to perform as independently as possible. However in the past years, it's been realized that correlations between neurons can have some advantages. For one. Having many neurons that encode the same thing may allow for error correction and more robust coding. It's also been realized that correlations can actually help discrimination, and indeed, neurons in the retina have been observed to be redundant. That is, that their joint distribution is very different from the product of independent distribution. More recently, Barlow proposed a new idea, that neuron populations should be as sparse as possible. That is that their coding properties should be organized so that as few neurons as possible are firing at any time. This idea was developed formally by Olshausen and Field, and also Bell and Sejnowski. Here's the idea. Let's say that one can write down a set of basis functions, phi i, with which to reconstruct a natural scene. Then any image can be expressed as a weighted sum, with coefficients ai over these basis functions with perhaps the addition of some noise. Now this basis function should be chosen so that as few coefficients ai as possible are needed in general to represent an image. This is carried out by minimizing a function that includes the reconstruction error. So here, the root mean squared difference between the reconstructed image and the image itself. So that one gets a good match to the images, but that also includes a cost term, whose role, whose role is to count how many coefficients are needed, so one simple choice of this cost function, is just the absolute value of these coefficients. [INAUDIBLE]. The coefficient lambda, weights the strength of that constraint. The job of this term is to penalize solutions that require too many basis functions to represent an image. Too many coefficients ai, that are, that are different from zero. A fourier basis for instance, represents the images as a sum of signs and cosines. While the fourier basis is guaranteed to be able to represent any image. One might already be able to guess that coding with such a basis is not sparse. Because, as you probably recall, the power spectrum is broad, which means, that many coefficients are needed. When one runs an algorithm to find the best basis functions, the best values of phi i, for natural images, one finds, instead, a set of functions that look like this, like localized oriented features, like those that we see in v one. So this implies that when we view an image using neuronal receptive fields that look like this, this excites on average a minimal number of neurons. This is called a sparse code. So we've touched upon several different ideas about coding principles. The idea of coding efficiency, that neural codes should represent input stimuli as efficiently as possible. We've seen that this implies adaptation to stimulus statistics. As one changes the statistics of the stimulus, one should see aspects of the coding model changing to ensure that it remains efficient. We've also brought up the idea of sparseness. That it would be ideal if the neural code needed as few neurons as possible to represent its input. And this brings us to the end of our discussion of coding. I've shown you some classic and state of the art methods for predicting how stimuli are encoded in spikes. We've seen models for decoding stimuli from neural. Responses. We've discussed information theory and how it's used to evaluate coding schemes, and we've taken a very quick glance at how coding strategies might be shaped by the statistics of natural inputs. There's a lot that we've missed. In particular, let's just go through the, the typical cycle of behavior of an organism. Where we have invested some time is the idea, that we go from complex environments, animals extract some features from that environment to solve problems, and that's represented in neural activity. What the brain is then doing is extracting that information and synthesizing it to drive decisions. We talked about some examples of using maximum likelihood methods that might in fact have neural implementation. These decisions then generate motor activity which drives behavior. Muscles work together to perform actions that drive behavior output. And these actions effect subsequent sensation. So, we didn't really address any of this part of the, of the behavioral feedback loop. Next week, we'll be moving onto a new topic. Rather than handling data analysis, we'll be moving more into the realm of modeling. And we'll start that with a brief introduction to the bio-physics of coding. How do single neurons generate action potential. We'll talk about neuronal excitability. And we'll close up with some simplified models that capture neuronal firing before moving on to the second part of the course where you'll be learning about network modeling. So that's all for this week. Looking forward to seeing you next week.