Now we'll look at another way to estimate WNB for linear model, called Ridge Regression.

Ridge regression uses the same least-squares criterion,

but with one difference.

During the training phase,

it adds a penalty for feature weights,

the WI values that are too large as shown in the equation here.

You'll see that large weights means mathematically that

the sum of their squared values is large.

Once ridge regression has estimated the WNB parameters for the linear model,

the prediction of Y values for new instances is exactly the same as in least squares.

You just plug in your input feature values,

the XIs and compute the sum of

the weighted feature values plus B with the usual in your formula.

So why would something like ridge regression be useful?

This addition of a penalty term to

a learning algorithm's objective function is called Regularisation.

Regularisation is an extremely important concept in machine learning.

It's a way to prevent overfitting, and thus,

improve the likely generalization performance of a model,

by restricting the models possible parameter settings.

Usually the effect of this restriction from regularisation,

is to reduce the complexity of the final estimated model.

So how does this work with linear regression?

The addition of the sum of squared parameter values that's shown in the box,

to the least-squares objective means that models with

larger feature weights (w) add more to the objective functions overall value.

Because our goal is to minimize the overall objective function,

the regularisation term acts as a penalty of

models with lots of large feature weight values.

In other words, all things being equal,

if ridge regression finds

two possible linear models that predict the training data values equally well,

it will prefer the linear model that has

a smaller overall sum of squared feature weights.

The practical effect of using ridge regression,

is to find the feature weights WI that fit the data well in at least square sense,

and that set lots of the feature weights two values that are very small.

We don't see this effect with a single variable linear regression example,

but for regression problems with dozens or hundreds of features,

the accuracy improvement from using

regularized linear regression like ridge regression could be significant.

The amount of regularisation to apply is controlled by the alpha parameter.

Larger alpha means more regularization and

simpler linear models with weights closer to zero.

The default setting for alpha is 1.0.

Notice that setting alpha to zero corresponds to the special case of

ordinary least-squares linear regression that we saw earlier,

that minimizes the total square here.

In scikit learn, you use rich regression by importing

the ridge class from sklearn.linear model.

And then use that estimate or object just as you would for least-squares.

The one difference is that you can specify

the amount of the ridge regression regularisation penalty,

which is called the L2 penalty,

using the alpha parameter.

Here, we're applying ridge regression to the crime data set.

Now, you'll notice here that the results are not that impressive.

The R-squared score on the test set is pretty

comparable to what we got for least-squares regression.

However there is something we can do in

applying ridge regression that will improve the results dramatically.

So now is the time for a brief digression about the need for

feature preprocessing and normalization.

Let's stop and think for a moment intuitively,

what ridge regression is doing.

It's regularizing the linear regression by imposing

that sum of squares penalty on the size of the W coefficients.

So the effect of increasing alpha is to

shrink the AW coefficients toward zero and towards each other.

But if the input variables, the features,

have very different scales,

then when this shrinkage happens of the coefficients,

input variables with different scales will have

different contributions to this L2 penalty,

because the L2 penalty is a sum of squares of all the coefficients.

So transforming the input features,

so they're all on the same scale,

means the ridge penalty is in some sense applied more fairly to all features

without unduly weighting some more than others,

just because of the difference in scales.

So more generally, you'll see as we proceed through the course that feature

normalization is important to perform for a number of different learning algorithms,

beyond just regularized regression.

This includes K-nearest neighbors,

support vector machines, neural networks and others.

The type of feature preprocessing and

normalization that's needed can also depend on the data.

For now, we're going to apply a widely used form of

future normalization called MinMax Scaling,

that will transform all the input variables,

so they're all on the same scale between zero and one.

To do this, we compute

the minimum and maximum values for each feature on the training data,

and then apply the minmax transformation for each feature as shown here.

Here's an example of how it works with two features.

Suppose we have one feature "height" whose values fall in

a fairly narrow range between 1.5 and 2.5 units.

But a second feature,

"width" has a much wider range between five and 10 units.

After applying minmax scaling,

values for both features are transformed because they are on the same scale,

with the minimum value getting mapped to zero,

and the maximum value being transformed to one.

And everything else getting transformed to a value between those two extremes.

To apply minmax scaling,

in scikit-learn, you import the minmax scalar object from sklearn.preprocessing.

To prepare the scalar object for use, you create it,

and then call the fit method using the training data Xtrain.

This will compute the min and max feature values

for each feature in this training dataset.

Then to apply the scalar,

you call it transform method,

and pass in the data you want to rescale.

The output will be the scale version of the input data.

In this case, we want to scale the training data and save it

in a new variable called Xtrain scaled.

And the test data,

saving that into a new variable called X-Tests-Scaled.

Then, we just use

these scaled versions of the feature data instead of the original feature data.

Note that it could be more efficient to perform

fitting and transforming in a single step on the training set,

by using the scalers fit transform method as shown here.

There's one last but very important point here,

about how to apply minmax scaling or any kind of

feature normalization in a learning scenario with training and test sets.

You may have noticed two things here.

First, that we're applying the same scalar object to both the training and the testing.

And second, that we're training

the scalar object on the training data and not on the test data.

These are both critical aspects to feature normalization.

If you don't apply the same scaling to training and test sets,

you'll end up with more or less random data skew,

which will invalidate your results.

If you prepare the scaler or other normalization method by

showing it the test data instead of the training data,

this leads to a phenomenon called Data Leakage,

where the training phase has information that is leaked from the test set.

For example, like the distribution of extreme values for each feature in the test data,

which the learner should never have access to during training.

This in turn can cause the learning method to give

unrealistically good estimates on the same test set.

We'll look more at the phenomenon of data leakage later in the course.

One downside to performing feature normalization is that

the resulting model and the transformed features may be harder to interpret.

Again, in the end,

the type of feature normalization that's best to apply,

can depend on the data set,

learning task and learning algorithm to be used.

We'll continue to touch on this issue throughout the course.

Okay, let's return to

ridge regression after we've added the code for minmax scaling of the input features.

We can see the significant effect of

minmax scaling on the performance of ridge regression.

After the input features have been properly scaled,

ridge regression achieves significantly better model fit

with an R-squared value on the test set of about 0.6.

Much better than without scaling,

and much better now than ordinary least-squares.

In fact if you apply the same minmax scaling with ordinary least-squares regression,

you should find that it doesn't change the outcome at all.

In general, regularisation works especially

well when you have relatively small amounts of

training data compared to the number of features in your model.

Regularisation becomes less important as the amount of training data you have increases.

We can see the effect of varying the amount of

regularisation on the scale to training and

test data using different settings for alpha in this example.

The best R-squared value on the test set is achieved with an alpha setting of around 20.

Significantly larger or smaller values of alpha,

both lead to significantly worse model fit.

This is another illustration of the general relationship between

model complexity and test set performance that we saw earlier in this lecture.

Where there's often an intermediate best value of a model of

complexity parameter that does not lead to either under or overfitting.

Another kind of regularized regression that you

could use instead of ridge regression is called Lasso Regression.

Like ridge regression, lasso regression adds

a regularisation penalty term to the ordinary least-squares objective,

that causes the model W-coefficients to shrink towards zero.

Lasso regression uses a slightly different regularisation term called an L1 penalty,

instead of ridge regression's L2 penalty as shown here.

The L1 penalty looks kind of similar to the L2 penalty,

in that it computes a sum over the coefficients but it's

some of the absolute values of the W-coefficients instead of a sum of squares.

And the results are noticeably different.

With lasso regression, a subset of the coefficients are forced to be precisely zero.

Which is a kind of automatic feature selection,

since with the weight of zero the features are

essentially ignored completely in the model.

This sparse solution where only a subset of

the most important features are left with non-zero weights,

also makes the model easier to interpret.

In cases where there are more than a few input variables.

Like ridge regression, the amount of regularisation for

the lasso regression is controlled by the parameter alpha,

which by default is zero.

Also like ridge regression,

the purpose of using lasso regression is to estimate the WNB model coefficients.

Once that's done, the prediction model formula is the same as for ordinary least-squares,

you just use the linear model.

In general, lasso regression is most helpful if you think there are

only a few variables that have a medium or large effect on the output variable.

Otherwise if there are lots of variables that contribute small or medium effects,

ridge regression is typically the better choice.

Let's take a look at lasso regression in scikit-learn using the notebook,

using our communities in crime regression data set.

To use lasso regression,

you import the lasso class from sklearn.linear model,

and then just use it as you would use an estimator like ridge regression.

With some data sets you may occasionally get a convergence warning,

in which case you can set the max_iter attribute to a larger value.

So typically at least 20,000, or possibly more.

Increasing the max inter-parameter will increase the computation time accordingly.

In this example, we're applying lasso to

a minmax scale version of the crime data set as we did for ridge regression.

You can see that for Alpha set to 2.0,

only 20 features with non-zero weights remain because with lasso regularisation,

most of the features are set to have weights of exactly zero.

I've listed the features with non-zero weights in

order of their descending magnitude from the output.

Although we need to be careful in interpreting

any results for data on a complex problem like crime,

the lasso regression results do help us see some of

the strongest relationships between

the input variables and outcomes for this particular data set.

For example, looking at the top five features with

non-zero weight that are found by lasso regression,

we can see that location factors like percentage of people in dense housing,

which indicates urban areas and socio economic variables like

the fraction of vacant houses in an area are positively correlated with crime.

And other variables like the percentage of families with

two parents is negatively correlated.

Finally, we can see the effect of tuning

the regularisation parameter alpha for lasso regression.

Like we saw with ridge regression,

there's an optimal range for alpha that gives

the best test set performance that neither under or over fits.

Of course this best alpha value will be different for different data sets,

and depends on various other factors such as the feature

preprocessing methods being used.

Let's suppose for a moment that we had a set of

two-dimensional data points with features X0 and X1.

Then we could transform each data point by adding additional features that

were the three unique multiplicative combinations of X0 and X1.

So, X0 squared, X0,

X1 and X1 squared.

So we've transformed our original two-dimensional points into a set of

five-dimensional points that rely only on the information in the two-dimensional points.

Now we can write a new regression problem that tries to predict

the same output variable y-hat but using these five features instead of two.

The critical insight here is that this is still a linear regression problem.

The features are just numbers within a weighted sum.

So we can use the same least-squares techniques to estimate

the five model coefficients for these five features that we

used in these simpler two-dimensional case.

Now, why would we want to do this kind of transformation?

Well, this is called polynomial future transformation that

we can use to transform a problem into a higher dimensional regression space.

And in effect, adding these extra polynomial features allows us a

much richer set of complex functions that we can use to fit to the data.

So you can think of this intuitively as allowing

polynomials to be fit to the training data instead of simply a straight line,

but using the same least-squares criterion that minimizes mean squared error.

We'll see later that this approach of adding

new features like polynomial features is also very effective with classification.

And we'll look at this kind of transformation

again in kernelized support vector machines.

When we add these new polynomial features,

we're essentially adding to the model's ability to capture interactions

between the different variables by adding them as features to the linear model.

For example, it may be that housing prices vary as

a quadratic function of both the lat size that a house sits on,

and the amount of taxes paid on the property as a theoretical example.

A simple linear model could not capture this nonlinear relationship,

but by adding nonlinear features like polynomials to the linear regression model,

we can capture this nonlinearity.

Or generally, we can use other types of

nonlinear feature transformations beyond just polynomials.

This is beyond the scope of this course but technically these are

called nonlinear basis functions for regression,

and are widely used.

Of course, one side effect of adding lots of new features

especially when we're taking every possible combination of K variables,

is that these more complex models have the potential for overfitting.

So in practice, polynomial regression is

often done with a regularized learning method like ridge regression.

Here's an example of polynomial regression using scikit-learn.

There's already a handy class called polynomial features in the

sklearn.preprocessing module that will generate these polynomial features for us.

This example shows three regressions on a more complex regression dataset,

that happens to have some quadratic interactions between variables.

The first regression here,

just uses least-squares regression without the polynomial feature transformation.

The second regression creates the polynomial features object with degrees set to two,

and then calls the fit transform method of

the polynomial features object on the original XF1 features,

to produce the new polynomial transform features XF1 poly.

The code then calls ordinary least-squares linear regression.

You can see indications of overfitting on this expanded feature representation,

as the models r-squared score on the training set is

close to one but much lower on the test set.

So the third regression shows the effect of adding

regularisation via ridge regression on this expanded feature set.

Now, the training and tests r-squared scores are basically the same,

with the test set score of

the regularized polynomial regression

performing the best of all three regression methods.