[MUSIC] Up until now, we've only been talking about image classification where we want to assign a single label to the main content of an image. But what happens if we have multiple objects of interest in one image. For example, let's say we trained a classifier to identify an image as either a cat or a dog. Then I feed this to the model. Is this a picture of a cat or a dog? The model is incapable of handling two objects so it will focus on the most salient features based on how it's been trained to make a decision. And it will classify this image which contains both a cat and a dog as either cat or dog, but not both. Remember the model we trained earlier to identify electronic components. I gave it these two images as test data to see what it would do. Both images contain a resistor and a capacitor. In the first image, it still successfully classified the resistor but it was not very sure of that choice. Even though the resistor was in the position, it might expect to find resistors right in the center of the frame with the leads going out to either side. When I moved the camera down slightly, the resistor was no longer in that spot. In this case the model focused on the capacitor this time and classified the image as such. Note that it won't always be what's in the center of the image, what the model focuses on has to do with how you train the model. If the resistors were always in the top left of the frame for the training data, then it might focus their when trying to classify something as a resistor. But for this model it happens to focus on what's near the center as that's how all the images in the training set were structured. What we've been doing up until now is known as image classification. Image classification is where we train a model to predict the class of an entire image. This works best when there is a single loan subject of one of our classes. When you introduce multiple objects in the image, the classification methods start to break down image classification also can't tell you exactly where in the image the object is located. So it's best if all the objects are framed in a similar manner for all the images you're working with. However, we can use other methods to find the location of objects. This is known as object localization. Note that with localization we don't care about the type of object that's being detected. We just want to know where some kind of object of interest is located. That's where we introduce object detection with object detection, we want to both locate something in the frame as well as classify it. Object detection can tell us where multiple instances of an object are in an image as well as discern among several classes of objects. Most object detection methods will attempt to both classify objects in an image and give us an idea of their location and relative size in the frame. They will do this by using what's known as a bounding box. A bounding box is a set of coordinates in the image that could be an X, Y. Width and height or the location of two opposite corners. The idea is that you could use this information to draw an overlay on the original image to get an idea where those objects are located in the frame. Along with the bounding box, you should also get a predicted class and confidence level of that class. For example, a well trained object detection model might give us a bounding box for the cat with a predicted confidence of 98%. As well as a bounding box for the dog with a predicted confidence of the dog class of 83%. With a little bit of math, you can find the center of these bounding boxes which should give you an idea where the middle of the object is. This is great if you say want to have a self driving car, avoid these objects or maybe want to have a pan tilt camera automatically follow them by moving so that the center of the frame lines up with the center of one of the objects. One easy way to do object detection is to do a simple sliding window with inference performed for each window. Let's say, I'm trying to determine the location of my dog in this frame. This works best if the window is the same size as your training set resolution, we then feed whatever is under this window to our image classification model and it performs inference to give us a prediction. Here it might predict background as the class. Then we slide the window over and send the new set of pixels under the window to the model. Note that I have a little bit of overlap between the two windows. You can adjust the amount of overlap by setting a stride value usually measured in number of pixels. If your stride is larger than your window width, you won't have any overlap. If it's less you'll have some overlap. Note that more overlap means a better chance of detecting your object, but it means that you'll need to perform more inferences per still image, which could really slow down your frame rate. We continue this process until the window highlights my dog here. Here the model should predict dog as the class. From here we could draw a bounding box, the size of the window to denote that a dog was detected. However, let's say that we keep sliding the window and marking the locations of dog detections. We might end up with something like this. There were three true positives that detected areas of my dog, one false positive as the model thought my hanging cables were a dog and one false negative where my dogs back legs were missed. This might be good enough for your project as you can still successfully see that a dog was present in the image. Knowing that the cables caused a false positive hit, we might go back with extra training data to get a better model. Let's say we do that and end up with something like this. We still see three instances of the dog class. However, with some math, we can determine that bounding boxes of the same class are overlapping and assume that it must be one large object. We can then find the extent of these bounding boxes to create a larger, more inclusive bounding box. You could average the individual classifications together to get some kind of total confidence score for the detected object. However, this might not work if you had multiple instances of dog next to each other. Note that this method might think overlapping objects are a single object. As you can see this is a basic approach to doing object detection, but it has a lot of problems. Will examine more advanced techniques for handling overlapping objects later. Something else you can do is resize the image under the window to the expected resolution of the models input. For example, maybe we take a bigger section that's 96 by 96 resize it to 48 by 48 before performing inference, this would keep inference fast and we could scan larger sections of the image. It also might allow us to look for dog objects that are closer to the camera. However, it might miss dog objects that are farther away as the model was not trained on images with this scale. While it might allow us to highlight the dog without computing the extent of several bounding boxes. We run the risk of missing the dog altogether if the stride is too high, the dog is too far away or if the model is not picking out the correct features due to the re sizing operation. Even with a small overlap as shown here, we'd still need to perform 48 inferences per still image and I'd still need to drop the right and bottom edges without doing some sort of padding or less than normal stride amount for the final windows along the edges. Let's say we have a fairly small model that can perform inference in 100 milliseconds and we don't need to resize our windows. We would still need to spend around 4.8 seconds per image to locate our object. That would give us about 0.208 frames per second if we wanted to do this on a live video feed. Your project might be fine with such a low frame rate, such as if you just needed to check for the presence of a person or animal every minute or so. However, this assumes that we can find what we're looking for in the frame using a square bounding box. Let's say we're working on a self driving car that needs to watch for pedestrians and other cars. These square search boxes would only be capable of covering a portion of the object in question. Even if the classifier identified the piece as belonging to a person or car, we'd still have to combine them in some way to denote something as just one object rather than several. That can be a tricky problem. One option is to scan the image with a variety of window sizes ranging from small to large. While this might help spot single instances better you run into two issues, it's hard to determine the exact location of tall or flat objects. For example, is the identified person in the middle, left or right of the bounding box shown. You would not be able to tell if you just got the coordinates and size of the bounding box. It also means many more inferences need to be performed to scan the image with different window sizes. Taking this a step further, what if we scanned the image with a variety of bounding box, shapes and sizes? I've removed the potential boxes for the car as it only cluttered the image. Here each location or every few pixels could be scanned with six or more differently shaped windows. The area under the window could be resized and skewed as necessary to match the input dimensions of the convolutional neural network. The network would be trained to know how to work with scaled images like this from here. We could select the bounding box with the highest score for that object, ideally we would end up with something like this to predict the location of our pedestrian. However, as you probably guessed, this is incredibly time intensive for both training and inference. Instead of sliding many windows over the same image, forcing us to perform inference hundreds or even thousands of times. What if we used another algorithm to propose possible windows for us? This might be a clustering algorithm that groups similar pixels together or even another convolutional neural network that picks out what it thinks are interesting areas in the input image. Note that they may or may not contain objects. They're just proposed windows that contain something interesting. We then send these windows to our convolutional neural network for classification. This process of identifying potential regions of interest is known as region proposal. Using a convolutional neural network to perform region proposal is known as a region proposal network. These areas are known as regions of interest. Hopefully our classifier, which we trained to identify vehicles and people would be able to correctly identify the cars and pedestrians in this image. Would combine or pick the best window for our bounding boxes and end up with something like this. This saves us a lot of time over the sliding window method to identify objects in an image. Region proposal as a way to feed an image classifier forms the basis for the region based convolutional neural network model, which will look at in a future lecture. Even though object detection methods may have many different parts, we can treat them as a single machine learning model. You can often train some or all of its parts during the training step. The ultimate goal of these object detection models is to locate and classify all of the objects of interest in the photo. For example, let's say I train this model to identify my dog, his tug toy and his ball. The output of such a model will often look something like this. You'll get a list of zero or more objects where each object contains the predicted label and information about the bounding box. If you're working with python, this might be a list of python objects you'll have to loop through or it could be something like a Jason string. The x, y location, it might be the center of the box or it could be the top left. You'll have to read the documentation for whichever model you're using to determine that. The class prediction will likely come with a probability score from the softmax output. With this information we can then draw the predicted boxes on the image. We could also use the x and y information to perform some task like shine a laser for a cat to play with or move a camera on a servo to face a particular object. The width and height of the bounding box might also provide some insights into how close an object is to the camera, assuming we have some information about the size of that object. There are a number of popular models that perform object detection like this. These models can be very complicated, and so the inner working details of them are outside the scope of this course. However, in a future lecture will briefly go over a few of the models and demonstrate how to use one of them in edge impulse. I will also make sure to list some articles in the recommended reading sections if you'd like to dig more into these concepts and models. [MUSIC]