In the previous example, we were using a clear, unambiguous image for a conversion. Sometimes there will be noise in images you want to OCR, making it difficult to extract the text. Luckily, there are techniques we can use to increase the efficacy of OCR with pytesseract and pillow. Let's use a different image this time with the same text as before, but with added noise to the picture. We could view this image using the following code. So from PIL we'll import image, pretty common for us now. Then we'll do an Image.open, and we'll pull out this Noisy_OCR.PNG, and then we'll use the display function in Jupyter to display it in-line. As you can see, this image has shapes of different capacities behind the text, which can confuse the tesseract image. Let's see if OCR will work on this noisy image. So import pytesseract, then we'll call pytesseract.image_to_string. And we'll just pass it in the image that we're going to open, this Noisy_OCR. And then let's print out the text directly. This is a bit surprising given how nicely pytesseract worked previously. Let's experiment on the image using techniques that will allow for more effective image analysis. First up, let's change the size of the image. So first we're going to import PIL. Then we set the base width of our image, so the base width, we'll set it to 600 points, these are in pixels. Now let's open the image, so that's old hat for us now, and we'll sign this to IMG. We want to get the correct aspect ratio, so we can do this by taking the base width and dividing it by the actual width of the image. So I'm going to create a new variable called wpercent and make this equal to the base width divided by the image.size sub zero, which is the width value there. With the ratio, we can just get the appropriate height of the image too. So I'll make something called hsize and set this to image size sub one, and we'll times that by the percentage, so we're just scaling here. Finally, let's resize the image. Antialiasing is a specific way of resizing lines to try and make them appear smooth. So here I'll just call image.resize, I pass it a tuple, which is the base width and the height size. And then I use this PIL.Image.ANTIALIAS to really just create better lines. Now let's save this to a file, so I'll call this img.save('resized_nois.png'), you could call it whatever you'd like. And finally, let's display it in-line, so I'll call display. And then let's run OCR. So again, pytesseract.image_to_string, and I'm going to open this new image underneath. I guess I could've just passed an image here. And then print the text. So this is not actually any improvement for resizing the image. And this is sometimes live when you're experimenting and trying to get things like this to work. Let's convert the image to grayscale. Converting images can be done in many different ways. If we poke around in the pillow documentation, we'll find that one of the easiest ways to do this is with the convert function, and we pass in the string a capital L. So let's open the image that we're working witt, and then let's call img.convert and pass in a capital L. Now let's save that image, I'm going to call it grayscale_noise.jpg here. Remember, PIL always worries about the file format for you based on the name of the image, so ending in .jpg here versus ending in .png is fine. And then let's run OCR on the grayscale image. And sort of prove there's no shenanigans, I'll open that grayscale image that we saved and pass it to image_to_string in pytesseract and print out the text. Wow, that worked really well. So if we look at the help documentation using the help function, as in help(img.convert), we see that the conversion mechanism used is the ITU-R 601-2 luma transform. There's more information about this out there, but this method essentially takes a three channel image, where there's information for the amount of red, green, and blue, or R, G, and B. And reduces it to a single channel to represent luminosity, and that's what the L is for. This method actually comes from how standard definition television sets encode color onto black and white images. If you get really interested in image manipulation and recognition, learning about color spaces and how we represent color, both computationally and through human perception, is a really interesting field. Even though we now have the complete text of the image, there's a few other techniques we could use to help improve OCR detection in the event that the above two don't help. The next approach I would use is one called binarization, which means to separate into two distinct parts, in this case, black and white. Binarization is enacted through a process called thresholding. If a pixel value is greater than a threshold value, it'll be converted to a black pixel. If it is lower than a threshold value, it'll be converted to a white pixel. This process eliminates noise in the OCR process, allowing greater image recognition accuracy. With pillow, this process is straightforward. So let's open a noisy image and convert it using binarization. So here we just image.open, we're going to read our noisy image in, and then we call convert and we pass in the character 1. Note that we're passing it as a character, not as a number, so this is a string value we're passing in. Now let's save and display that image, so img.save, we'll call it black_and_white noise.jpg, and display. You can see here the image looks kind of dotted and modeled, there's various different patterns in it, but definitely this is a black and white image. So that was a bit magical, and it really required a fine reading of the docs to figure out that the number 1 is the special string parameter to the convert function that actually does the binarization. But you actually have all the skills you need to write this function by yourself. Let's walk through an example. First, let's define a function called binarize, which takes in an image and a threshold value. So I'll def binarize and image_to_transform, and then some threshold value. Now, let's convert the image to a single grayscale image using convert. So here we just create some new output image, this is what we'll end up returning, and we'll transform the image, passed in by the color, to luminosity values only. So right now there's nothing new magical here to be done, this is just creating a grayscale image. The threshold value is usually provided as a number between 0 and 255, which is the number of bits in a byte. The algorithm for the binarization is pretty simple, go through every pixel in the image, and if it's greater than the threshold, turn it all the way up, so to 255. And if it's lower than the threshold, turn it all the way down, so that's to 0. So let's write this in code. First, we need to iterate overall the pixels in the image. So for x in range, and we'll just go over the widths, so values along the x axes. And then for y in range, it will go through the heights, so these will be our values through to the y axes. So for a given pixel at some width and height, let's check its value again to threshold. So we could do this with if output_image.getpixel, we'll just pull the pixel x and y. You'll note lots of brackets here, that's because we are actually passing a tuple value in. We just check to see if it's less than some threshold value. So let's set this to 0 if it is. So in our output image, we just put that pixel, we pass in the same x, y, and we put it to 0. So we're just changing it to 0 if it's less than a threshold. Otherwise we want to set this to 255, so output_image.putpixel( ( x,y), 255 ). And now we just return the new image. So let's test this function over a range of different thresholds. Remember that you can use the range function to generate a list of numbers at different step sizes. Range is called with a start, a stop, and a step size. So let's try the range 0, 257, and 64, which should generate five images of different threshold values. So for a thresh in range 0 to 257, and then we're going to step at 65. Let's print out a string to tell us what threshold we're trying here. And so we want to change, remember, the thresh value is an integer, so we'll change it to a string here using the str function. And then let's display the binarized image in-line. And so the way we do this is the display function, then we're going to call our function,binarize, we're going to pass it the image.open, read_only/Noisy_OCR. We could of course cache this, open it, and pass it around as a parameter, but it's okay for our demonstration to do it this way. And then we'll send in the threshold value, which will be 0 the first time, 60 for the second time, and so forth. And let's use tesseract on it. It's inefficient the binarize it twice, but this is really just for a demo. So here we'll call print pytesseract.image_to_string, passing in then a call to binarize, which passes in a call to image.open. So there's a lot of image.opens here, lots of room this code could be improved, but it should generate an example for us. So you can see the result with threshold 0 is pretty empty. With threshold 64 we actually get a very faint looking image, but it seems like we get all of or most of the text. When we increase the threshold to 192 from 128, we see that we actually pick up a new space between the words of and this, so we're getting more definition in the text. But then when we increase the threshold all the way to 256, we lose a lot of text because a whole segment of the image becomes black. And then, all of a sudden, at the very top end threshold, we get nothing because the whole image is black at that point. We could see from this that a threshold of 0 essentially turns everything white, that the text becomes more bold as we move towards a higher threshold. And the shapes, which have a filled-in gray color, become more evident at higher thresholds. In the next lecture, we'll look a bit more at some of the challenges you can expect when doing OCR on real data.