Welcome to Week 11 of our medical software class. In this week's lectures, we'll focus on artificial intelligence machine learning, the use of these technologies, the challenges they produce, as we integrate them into software and the whole aspect of how one regulates these things, how do you create rules to ensure that the use of such techniques results in safe software that doesn't endanger other users and their caregivers. These are the goals of this week's. We'll introduce what machine learning is, we'll discuss deep learning. Then we'll talk a little bit about the regulatory aspects of the use of artificial intelligence machine learning in medical software. Finally we'll touch a little bit on artificial intelligence machine learning as part of the software life cycle. I made the decision as we designed this class not to sprinkle AI/ML throughout the last five weeks, but to save all of those issues for this segment here. We have six segments; the first two are the basics. We'll introduce AI/ML techniques. Then we'll talk a little bit about the evaluation of machine learning methods and related issues. Then we'll introduce deep learning, Prof Nicha Dvornek will come and give a presentation on that. Then I'll come back and talk about the regulatory aspects. First on overview of the regulatory guidance from another countries, and then we'll do a deep dive into the Singapore guidance for AI/ML, which is one of the better of documents that's out there. Then finally, Professor John Onofrey will come in and talk about AI/ML and the software life cycle. This week's material is spread all over the textbook. There's some chapter 8, chapter 9, and chapter 15. But you can track it from the index if you want to do more reading on this topic. Let's get started. Let's start on introduction on artificial intelligence machine learning. What do these terms mean? What do they are? These are terms that are somewhat confused. You'll hear them used interchangeably. People who have a long background in computer science have very specific meanings for them, others just use this term more loosely. We'll try to give some definitions, but please realize that the use of this terms is far from consistent and you'll hear people use them interchangeably, or use slightly different definitions as we go. Here's oneself definition from John McCarthy from Stanford, one of the fathers of this field. AI is defined as the science and engineering of making intelligent machines, especially intelligent computer programs. If you think about AI is the superset You can see the big green oval on the right of the screen. AI can use different techniques, such as ML, machine learning, to produce intelligent behavior including models based on statistical analysis of data and expert systems that primarily rely on if-then statements. ML is a subset of AI. In this paper, and this is a usage we will use here too, we refer to an ML system as a system that has the capacity to learn based on training on a specific task by tracking performance measures. Then the final statement here, AI and specifically ML are techniques used to design and train software algorithms to learn and to act on data. Finally deep learning, which is the other term you'll hear, is one particular machine learning technique, although it's by far the most popular right now. When people think about machine learning and AI, for all intents and purposes, right now they're thinking about deep learning. Deep learning has taken over this field. The definition that I would like to think about when you think about ML, it is a technique that lets you learn a relationship from between data, so given the input, it can predict the output. It's a learning mechanism that learns patterns. If I give you a set of images, and a set of diagnosis for say, cancer patients, I learn on relationships, I can predict a diagnosis from the images. When I give you a new image, you can get a prediction for a new diagnosis using this output. That is the essence of what machine learning is right now. This is the informal definition that you'll hear out there. This is from a tweet by Mat Velloso on the difference between machine learning and AI. If it is written in Python, it is probably machine learning, if it is written in PowerPoint, which is probably a AI. What he is referring to is that the people who work in the field and do hands on work called machine learning, people at the business level, and consultants, and people like that call it AI. You can almost hear from the vocabulary gets used what the person's background is. That's another interesting comment. If you came up through the traditional engineering schools, through electrical engineering, you're introduction to this field probably come from this book, Duda & Hart. This goes all the way back to 1973. This is the other name of this field it is called pattern classification. The book is called Pattern classification and scene analysis. People who come from an engineering background think of machine learning as a successor or even just a renaming of what was pattern classification, and historically we would not have thought of it as part of AI, but this is just arguing about names. This a good classic book, if you can get your hands on it it's great to read, even 50 years later. Of the many kinds of machine learning there are, we will focus on supervised learning. This is the fundamental technique that we're dealing with. In supervised learning, we have an input x, perhaps images, we have a function f, perhaps a neural network, and we have an output y, maybe in this case a diagnosis. Let's walk through the steps. We have a training phase. Given training data pairs (x, y), images plus associate diagnosis perhaps, we learn a function that takes x and maps it to y. F can be any function, but typically these days it's the deep neural network. What we're learning here is F. When we say the system learns, it learns this function. It learns the parameters of a deep neural network. Then we have the application phase. In the future, we get a new x, let's call this x prime. We use the existing f, the one that we have trained here to estimate the new y, y prime, which is the application function to x prime. We're estimating y prime. In the training phase, will an f, in the application phase, we're estimating y prime. When you design software that uses a machine learning technique, outside of the actual software, you have a separate set of code that learns this function f. This gets stored and gets incorporated into the software, which in the application files just applies at function f to estimate the new y. The hard part in all supervised learning is getting code of the training data x, y. This often requires experts to label the input data x to generate the target outputs y. For example, in the example I just gave above of images plus diagnosis, you got to get a lot of images and have experts to read through them, confirm the diagnosis so you can build your training set that you can now use to train your outcome that can be applied in the future. The two most common machine learning applications are classification, given some input, we output the class belongs to, for example, on healthy versus disease, cancer versus non-cancer, something discreet. Then regression, given some input, for example, time spent studying, past grades of a student, we predict somehow to perhaps a test score. Regression is a continuous function and really one is categorical and the other is continuous, and while in textbooks these are treated differently and there's some interesting Mathematics that separate from them. If you actually try to implement the things with deep neural networks, the only real difference is almost the last layer of the network, the rest may almost be the same. It's just what is the evaluation criteria? It's different for a discrete versus a continuous function, but other than that, the two things may almost be the same thing. Let's talk a little bit about classification as an example. Let's cast it in a little bit of a more traditional sense before the deep learning error here a little bit. Given some measurements, which we'll call the descriptors or the features, we assign a label to these measurements. This is our goal. It's a two-step process. Given a thing, we mapped it into measurements or features, this is the representation. If we have a person for our purposes, we reduce a person into two measurements, their weight, and their height. If we're trying to figure out this person is overweight or healthy, height and weight are actually a pretty good description of a person. Given these measurements, the features we classify. We take weight and height and we output a value that healthy here or overweight. Now, it could get more complicated, but let's stick to this classification for our example here. How does this work? Let's assume that now we're back to say in 1975 perhaps, and let's think about how the machinery of this process would have worked. Here are healthy subjects, we have a set of healthy subjects, so we have the heights and weights. These are some single-subject here with this height and this weight, and some expert physician has classified this person as healthy. This applies to, all the green ones are healthy, all the red ones are overweight and you can see that they have high weight, low height. That is a typical definition of being overweight. Then what we're going to do here is you're going to learn a line in this two-dimensional plot that separates the healthy on the top here from overweight on the bottom. If we bring in a new subject and you happen to land here, would classify you as overweight, if you happen to land here, will say you are healthy and that is a process. We assume, and this is a fundamental aspect here, is that an expert has already labeled the subjects as healthy or overweight, and this is the hard thing about machine learning. People talk of machine learning as magic, but the fundamental problem, machine learning isn't math. The math actually often fairly and straightforward is getting access to these data pairs, the inputs plus the labors. That is a hard thing. That is why all these internet companies, you look around the Googles, the Facebooks of the world are vacuuming data like crazy because is the data that actually define the performance of this algorithm. Let's see how we do this in practice. This is the simplest algorithms and it comes from Duda & Hart. This is all the way back, the early 1970s. We compute the average for each class, so this green circle is the average person here, the average height, the average weight of the healthys. This red circle is the average height weight of the overweights. Then we find the line that bisects the line connecting the average, so that's to put it graphically, you connect the averages and find the line that cuts this line in two. Anything below that is overweight, anything above that is healthy. If we take this algorithm and apply it, so now we're in the application phase. All we have is the line. This is what we have learned, this is our ref from the previous consideration. Now here comes a new subject. They land there, based on algorithm is above the line, we classify it as healthy. Here's another subject, is below the line, we classify it as overweight. The third subject, they're healthy, the fourth subjects they are overweight. That is essentially the process of using machine learning. You train based on data, you learn a function, you keep the function, and then you apply it in the future. That is a concept here. One interesting aspect and this is very similar to what we saw earlier when we talked about parametric versus non-parametric estimation for probability destined functions is this distinction between parametric versus non-parametric. In the previous example, we just learned a line, that's a parametric function is just a line within two parameters. You can also learn non-parametric functions that learn a lot more from the data and actually are implicitly defined as opposed to explicitly as a parametric function. Let's take a look at an example here. Before, just to recap, we assume a format for a boundary line, we learned the parameters of the line. But what if we don't know what the shape of the boundary should be? Sometimes we can't tell. How do we do that? The oldest method in the book is nearest neighbor. Here's a set of data. We keep it all in memory. When the subject comes in we ask the question, is the most similar sample of those we have in memory healthy or overweight? We use this to classify. If you think about this, this is how we often do many tasks in life and this is what doctors often do when they do diagnosis. You look at a patient and say, what patient does this remind me of? You find the most similar patient in your long memory of seeing previous patients and say, well, that was a diagnosis of chances of have this person is going to have a similar case. Let's formalize it a little bit and just show you what the decision boundary looks like. We have a nearest neighbors. We have the samples here rather the overweight, greens are the healthy. If we have this separation, we end up with a boundary like this. On top you're healthy, on the bottom of this you're overweight. This irregular curved boundary. Now what if we have one more sample in a database? That person up there, despite appearances, is actually overweight even though they appear to lay on the healthy category. We don't know them, maybe something genetic about them. Now our boundary becomes this discontinuous function. Below here or above here; you're overweight in the middle, you are healthy if we continue with a previous analogy. When it comes to neural networks which are they're all more complicated, we end up with many islands. We could have an island here, an island here, an island here. We divide this area into almost a check-up or type pattern with healthy and overweight or the two classes interspersed. An extension of this technique is K nearest neighbors. Instead of looking at the nearest neighbor, we look at the nearest k, where k could be 357, and we do majority vote. We find of your five neighbors four were healthy, you classify yours healthy and so on and so forth. To conclude a little bit, let's just erase some problems. All so far sounds very good. We almost never have enough samples, especially when we're dealing with high dimensions. So far we've talked about height and weight; a two-dimensional description of a person. If you think about my grid, I can have enough to fill in a grid. If I want 10 points along the row and 10 points along the columns, maybe I need 100 subjects. That's a reasonable number. Now if I'm doing a higher-level classification where I have eight features, all of a sudden now this blows up to 10^8 and all of the sudden I need millions of people to fill in my space. Very quickly we run out of the ability to have enough samples as a dimensionality goes high. The other big problem as you're learning is that we can over-learn from our samples. What if one subject or a small number of subjects is stretched? Think about K nearest neighbor and if you have a bad subject there. Somebody who was misclassified perhaps. Now that person's going to propagate throughout your result process. The training data can be mislabeled, which is the previous problem. They could be just transcription errors potentially and the dimension can just be too high. People like to use lots and lots of features; it feels better to have a richer description. But if the dimensionality is too high, that can get you in trouble. The correct answer, and this is the fundamental problem with machine learning, is only known on the locations of our samples. We have no general truth. We only know the answer at those 20 locations we'll have training data, we're trying to interpolate in that empty space but we have no idea what that function is. Just to give you a little bit of a different view of a topic, we have another clip from Christian Kastner from Carnegie Mellon University. We'll talk a little bit about this issue. In machine learning, we have example data, we feed the example data, and we see how often do we get the expected output. We measure accuracy on many other metrics in this area we call precision efficacy; whatever. We get some percentage typically. We see how close do we get to what we want. We don't usually know for arbitrary inputs what we're expecting, so the way that we've writing test cases is not the same. How we can pick examples. We typically select a representative sample of data that we're testing on. We're not really looking at corner cases as we might do in software engineering. If one prediction is wrong, that's not the end of the word. We accept this; its just lower accuracy. The input we're looking for is not really correctness of the system because we don't really know what correct means. Traditionally, correct would mean all the answers are right to some standard to some specification. We're looking for fits. The nice saying here is all models are wrong, but some models are useful. What we're looking for is a model that's good enough to be useful in practice that works most of the time or in a sufficient number of cases. There might be some hard problems, but 10 percent accuracy might actually be much better than we had before. There was some problems where 99.9 percent accuracy is not acceptable because it's not useful for the problem. That is the actual point here. When we resort to machine learning techniques, it is often because we have no other way of getting at an answer. We don't understand the problem, we learn from examples, and we have to take all these other issues that come together. We hope to find something that's good, that works most of the time, and depending on how serious the application is, that number goes up and up as we go. Then, as we'll see towards the end of this week's segments, we build in safeguards around that. When things go wrong, we can detect them and pulling the human expert to do corrections. With that, we'll stop this segment. In the next one, we'll discuss a little bit of the issues of how these things are tested and some of the problems that occur in these very data-driven algorithms. Thank you.