[MUSIC] Artificial neural networks or neural networks for short are ancient tools at least in the timeline of AI methods. But lately they've been in the news, we introduced them way back in course 1 since deep learning is all around. And indeed due to increased computation power, availability of data, as well as advancements in deep learning. Neural networks have been incredibly strong in natural language processing, computer vision, robotics. In this video we're going to look at neural networks, what they are and what applications they're used for. We are not going to get into how to use them, because they involve quite a bit of engineering. For that we recommend taking one of Coursera's courses specifically about deep learning. So, although this course is about supervised learning, neural networks are by no means limited to being used only for classification and regression. In fact, neural networks can also be found in unsupervised and reinforcement learning. They're really a sophisticated technique for function approximation, which is finding complex functions to describe the patterns you care about. Deep learning algorithm start with a huge hypothesis space. That's why they're extra useful when you have lots and lots of data but can have issues with overfitting. Okay, let's focus on a classification neural network. Just like always it's a question answering machine that takes in examples of operational data and answers with a category. So no matter how complicated this neural network is or how its found, we give it an example and it tells us the category. We start with some feature matrix X and we'll put some label array Y, but how do we get there? Neural networks were inspired by biological neurons, and so we call their basic component an artificial neuron or just neuron for simplicity. What is a neuron? Well to us, it's a mathematical operator that takes some number as inputs and gives a number as its output. Yep, that's a function just like we're usually talking about, what makes neuron special is that we interconnect them. So a neural network is a number of interconnected neurons, the default way to connect them is in layers. So the output of one batch of neurons serves as the input to another, the very first input is the example it's classifying. And so we call that layer the input layer, the input layer is then fed into the next layer which is some set of neurons which yes are just functions. So the only way this isn't simple regression is that there's a bunch of them, all giving different answers. This is the first hidden layer, it's got some numbers in there that means something probably. But really all that matters Is there the output of some set of functions. Now we know that at the end the output of our last layer has to be a category or at least one transfer function away from a category. Not surprisingly we call the last layer the output layer, and the output is our final answer. The category that the neural network question answering machine predicts is correct. In between the output and the input layers can be any number of hidden layers and in the case of recurrent neural networks, they can maybe even loop back on themselves. But the default is a feed-forward neural network that feeds each hidden layer with the output of the layer before. Starting with the input data and ending with the output layer proposing a category. When we have multiple hidden layers, the neural network is called a deep neural network and the process of learning is deep learning. It's the introduction of techniques that can handle the optimization for multiple hidden layers that led to a revival of interest in neural networks and a lot of our modern successes. So let's take a closer look at our neurons, which so far I've described as yes perfectly ordinary functions. You can imagine how if I left it completely open-ended we end up with the ridiculously large hypothesis space, and that is true. But we are going to add some structure, typically each neuron actually involves two separate simple functions, a linear function, then a nonlinear transformation. You might have a sense of why given all we discussed in module two, lines are simple linear functions are really easy to calculate, there are differentiable and convex. What's not to love? Well yeah, they're simple not everything we care about fits on a line. Our hypothesis space is often too small when we only look at linear functions, hence the nonlinear step. It's sometimes even called the transfer function, although activation function is more common. So within each neuron in our network the first linear function does what all linear functions do, multiplies its weights against the input values and returns the thumb. Then that value is passed to the activation function and out comes some other number. We can't get into specifics here it's a really long conversation and likely requires extensive experimentation to find the right one. But some activation functions you might hear of our sigmoids, hyperbolic tangents and rectified linear units or ReLU. And by the way, nobody said each layer has to use the same activation function, let alone have the same number of neurons. But regardless, the output of the activation function is the output of that neuron. And the output of one neuron is the input to a neuron in the next deeper layer, until finally the neurons of the output layer feed their answers into the final neuron. It applies its linear function and then it's activation function, whatever that might be and out comes the final answer. That's how it works in the end. How does it get there? Where do these functions come from? In other words how is a neural network learned? The weights of those linear functions, that's what a neural network is learning and how are they learned? Calculus, gradient descent, just like we already saw in linear regression and logistic regression. And hey, would you look at that? You could actually implement logistic regression as a neural network. It's a neural network with an input layer the features and a single neuron output layer. That single neuron performs the usual linear operation and then transforms the output by sending it through the logistic transfer function, which in the context of neural networks is more commonly called the sigmoid function. So, linear followed by nonlinear. The output in this case is we hope the probability of a data point belonging to a class, just as we originally described this probability is then compared against a threshold to arrive at a class by using a unit step function. That makes some sense of how the weights for the last neuron can be found there adjusted to minimize loss contrasting its output with the correct label. But what about all those other neurons, the ones in the hidden layers, what makes one side of weights better than another? Actually, it's the same principle as for the output layer, except now we don't care that this hidden neuron predicts the label correctly. What we care about is that it does something useful for the next layer, and that's a team effort. With a neural network we don't want to optimize each neuron separately. It isn't clear how we could optimize each neuron separately. We know what we want the entire network to do. This is supervised learning. So we want it to map the inputs into some appropriate output, but it's rarely clear what each neuron of the layers between the input and the output layer should be optimized to do. This mystery is partly why we call them hidden layers, but we do know we need each layer to output values that will make the work of the next layer easier. And that last layer the output layer should predict the correct label. So our usual optimization method for neural networks starts there and works backwards. We call this algorithm backpropagation. It actually uses gradient descent which you already know about to iteratively adjust the weights of each layer of the neural network starting with the output. More detailed than that will have to come from an in-depth course. Neural networks are powerful method for machine learning. They're very flexible, you can learn complex nonlinear functions and can also make effective use of very large feature spaces. This has resulted in never-before-seen success in applications, like language translation, voice transcription. Recognition of pedestrians and road signs for autonomous cars, digitizing books, check processing, digital personal assistance and more. But, neural networks can't be used for everything. Remember how I mentioned that the hidden layers are a bit mysterious. One of the open problems around neural networks is explain-ability, which basically means understanding why a neural network gives the output it does. We currently can't explain the choices of a neural network to a human. Which means we're reasonably wary about using neural networks to make decisions that could have detrimental effects on people's lives. Like for healthcare physical systems engineering is policy or policing. Back in the first course we talked about examining your need for explain-ability in your final model. And we also pointed out that bias and data can be amplified by machine learning algorithms, making the neural networks explain-ability or lack thereof important. And as you've learned nothing comes for free, the huge hypothesis space explored by neural networks creates variance. We can represent complicated functions, but we might over-fit, this is why having lots of data is incredibly important when using a deep learning algorithm. But if you have the data go for it, deep learning has led to many advances that even a decade ago were thought to be absolutely impossible. Like beating the world's players in the complex game of Go. Although if you do go for it, you probably want to spend some time exploring the ins and outs of network structures, and activation functions and how to implement the darn things. Luckily there's a course for that. All right, now you understand neural networks, at least at a high level. You've seen what an artificial neuron actually does and somewhat how their learned. And now you truly understand why you need so much data for a neural network to actually generalize. We'll leave further exploration to other courses and move on to another significant classification method, support vector machines.