Overfitting & Cross-Validation

👨‍🏫 [Sync]

%matplotlib inline

import pandas
import seaborn as sns
import matplotlib.pyplot as plt
from numpy import exp, log1p, sqrt

Assuming you have the files downloaded on the path ../data/:

orig = pandas.read_csv("../data/paris_abb.csv.zip")
res = pandas.read_parquet("../data/lm_results.parquet")

If you’re online, you can do:

orig = pandas.read_csv("https://github.com/darribas/data_science_studio/raw/master/content/data/paris_abb.csv.zip")
res = pandas.read_parquet("https://github.com/darribas/data_science_studio/raw/master/content/data/lm_results.parquet")
db = base.join(res)

Overfitting

So we have models but, are they any good?

Model evaluation the ML is all about predictive performance (remember the \(\hat{\beta}\) Vs \(\hat{y}\), this is all about \(\hat{y}\)).

f, axs = plt.subplots(2, 2, figsize=(12, 12))
axs = axs.flatten()
for i, m in enumerate(res.columns.drop("Truth")):
    ax = axs[i]
    sns.regplot("Truth", 
                m, 
                res,
                ci=None,
                ax=ax
               )
f.suptitle(f"Observed Vs Predicted")
plt.show()
/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y, data. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y, data. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y, data. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y, data. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
../../_images/09-Overfitting_cv_7_1.png

There are several measures, we’ll use two:

  • \(R^2\)

from sklearn.metrics import r2_score

Let’s compare apparent performance:

r2s = pandas.Series({"LM-Raw": r2_score(db["Truth"], 
                                        db["LM-Raw"]
                                       ),
                     "LM-Log": r2_score(log1p(db["Truth"]),
                                        db["LM-Log"]
                                       ),
                     "RF-Raw": r2_score(db["Truth"],
                                        db["RF-Raw"]
                                       ),
                     "RF-Log": r2_score(log1p(db["Truth"]),
                                        db["RF-Log"]
                                       ),
                    })
r2s
LM-Raw    0.102694
LM-Log    0.315159
RF-Raw    0.225217
RF-Log    0.443318
dtype: float64
  • (R)MSE

from sklearn.metrics import mean_squared_error as mse

And a similar comparison (where we can convert all predictions to price units):

mses = pandas.Series({"LM-Raw": mse(db["Truth"], 
                                    db["LM-Raw"]
                                   ),
                      "LM-Log": mse(db["Truth"],
                                    exp(db["LM-Log"])
                                   ),
                      "RF-Raw": mse(db["Truth"],
                                    db["RF-Raw"]
                                   ),
                      "RF-Log": mse(db["Truth"],
                                    exp(db["RF-Log"])
                                   ),
                     }).apply(sqrt)
mses
LM-Raw     149.522153
LM-Log    7570.085725
RF-Raw     138.939346
RF-Log     140.904608
dtype: float64

Now this is great news for the random forest! 🎉

… or is it?

Let’s do a thought experiment. Imagine that AirBnb, for some reason, did not initially move into the Passy neighbourhood.

Let’s quickly retrain our models for that world. For convenience, let’s focus on the raw models, not those based on logs:

room_type_ds = pandas.get_dummies(db["room_type"])
X = pandas.concat([db[["bathrooms", "bedrooms"]],
                   room_type_ds.drop("Entire home/apt",
                                     axis=1)
                  ], axis=1
                 )
# Passy IDs
pi = db.query("neighbourhood_cleansed == 'Passy'").index
# Non Passy IDs
npi = db.index.difference(pi)

# Non Passy data (same as in previous notebook)
X_np = X.reindex(npi)
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

# Linear model
lm_raw_np_model = LinearRegression().fit(X_np, 
                                         db.loc[npi, "Price"]
                                        )
lm_raw_np = lm_raw_np_model.predict(X_np)
# RF
rf_raw_np_model = RandomForestRegressor(n_estimators=100,
                                        max_features=None
                                       )\
                                      .fit(X_np, 
                                           db.loc[npi, "Price"]
                                          )
rf_raw_np = rf_raw_np_model.predict(X_np)

The apparent performance of our model is similar, a bit better if anything:

mses_np = pandas.Series({"LM-Raw": mse(db.loc[npi, "Truth"], 
                                       lm_raw_np
                                      ),
                         "RF-Raw": mse(db.loc[npi, "Truth"],
                                       rf_raw_np
                                      ),
                        }).apply(sqrt)
mses_np
LM-Raw    144.326619
RF-Raw    134.277649
dtype: float64
mses.reindex(mses_np.index)
LM-Raw    149.522153
RF-Raw    138.939346
dtype: float64

Now imagine that Passy comes on the AirBnb market and we need to provide price estimates for its properties. We can use our model to make predictions:

# Linear predictions
lm_raw_passy_pred = lm_raw_np_model.predict(X.reindex(pi))
lm_raw_passy_pred = pandas.Series(lm_raw_passy_pred,
                                  index=pi
                                 )
# RF predictions
rf_raw_passy_pred = rf_raw_np_model.predict(X.reindex(pi))
fr_raw_passy_pred = pandas.Series(rf_raw_passy_pred,
                                  index=pi
                                 )

How’s our model doing now?

mses_p = pandas.Series({"LM-Raw": mse(db.loc[pi, "Truth"], 
                                      lm_raw_passy_pred
                                     ),
                        "RF-Raw": mse(db.loc[pi, "Truth"],
                                      fr_raw_passy_pred
                                     ),
                       }).apply(sqrt)
mses_p
LM-Raw    235.291495
RF-Raw    218.293239
dtype: float64
mses_np
LM-Raw    144.326619
RF-Raw    134.277649
dtype: float64

Not that great on data the models have not seen before.

Linear regression is a particular case in that, given the features chosen, one cannot tweak much more; but the RF does allow us to tweak things slightly differently. How is a matter of (computational) brute force and a bit of savvy-ness in designing the data split.

Cross-Validation to the rescue

  • What it does: give you a better sense of the actual performance of your model

  • What it doesn’t (estimate a better model per-se)

Train/test split

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,
                                                    db["Price"],
                                                    test_size=0.8
                                                   )

For the linear model:

lm_estimator = LinearRegression()
# Fit on X/Y train
lm_estimator.fit(x_train, y_train)
# Predict on X test
lm_y_pred = lm_estimator.predict(x_test)
# Evaluate on Y test
sqrt(mse(y_test, lm_y_pred))
146.63691841889124

For the random forest:

rf_estimator = RandomForestRegressor(n_estimators=100,
                                     max_features=None
                                    )
# Train on X/Y train
rf_estimator.fit(x_train, y_train)
# Predict on X test
rf_y_pred = rf_estimator.predict(x_test)
# Evaluate on Y test
sqrt(mse(y_test, rf_y_pred))
140.59053185334008

Worse than in-testing, but better than on an entirely new set of data with (potentially) different structure.

Now, the splitting that train_test_split does is random. What if it happens to be very particular? Can we trust the performance scores we recover?

\(k\)-fold CV

from sklearn.model_selection import cross_val_score

For the linear model:

lm_kcv_mses = cross_val_score(LinearRegression(),
                              X,
                              db["Price"],
                              cv=5,
                              scoring="neg_mean_squared_error"
                             )
# sklearn uses neg to optimisation is alway maximisation
sqrt(-lm_kcv_mses).mean()
156.0444939069434

And for the random forest:

rf_estimator = RandomForestRegressor(n_estimators=100,
                                     max_features=None
                                    )
rf_kcv_mses = cross_val_score(rf_estimator,
                              X,
                              db["Price"],
                              cv=5,
                              scoring="neg_mean_squared_error"
                             )
# sklearn uses neg to optimisation is alway maximisation
sqrt(-rf_kcv_mses).mean()
141.65779397663923

These would be more reliable measures of model performance. The difference between CV’ed estimates and original ones can be seen as an indication of overfitting.

mses
LM-Raw     149.522153
LM-Log    7570.085725
RF-Raw     138.939346
RF-Log     140.904608
dtype: float64

In our case, since we’re using very few features, both models probably already feature enough “regularisation” and the changes are not very dramatic. In other contexts, this can be a drastic change (e.g. Arribas-Bel, Patino & Duque; 2017).

Parameter optimisation in scikit-learn

Random Forests really can be optimised over two key parameters, n_estimators and max_features, although there are more to tweak around:

RandomForestRegressor().get_params()
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Here we’ll exhaustively test each combination for several values of the parameters:

param_grid = {
    "n_estimators": [5, 25, 50, 75, 100, 150],
    "max_features": [1, 2, 3, 4, 5]
}

We will use GridSearchCV to automatically work over every combination, create cross-validated scores (MSE), and pick the preferred one.

from sklearn.model_selection import GridSearchCV

This can get fairly computationally intensive, and it is also “embarrasingly parallel”, so it’s a good candidate to parallelise.

grid = GridSearchCV(RandomForestRegressor(),
                    param_grid,
                    scoring="neg_mean_squared_error",
                    cv=5,
                    n_jobs=-1
                   )

Once defined, we fit to actually execute computations on a given set of data:

%%time
grid.fit(X, db["Price"])
CPU times: user 631 ms, sys: 165 ms, total: 796 ms
Wall time: 10.5 s
GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'max_features': [1, 2, 3, 4, 5],
                         'n_estimators': [5, 25, 50, 75, 100, 150]},
             scoring='neg_mean_squared_error')

And we can explore the output from the same grid object:

grid_res = pandas.DataFrame(grid.cv_results_)
grid_res.head()
mean_fit_time std_fit_time mean_score_time std_score_time param_max_features param_n_estimators params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.059308 0.002531 0.007023 0.000338 1 5 {'max_features': 1, 'n_estimators': 5} -14164.970336 -24324.348853 -10955.871710 -28297.397306 -27114.016072 -20971.320855 7060.913913 29
1 0.260195 0.003964 0.019304 0.000458 1 25 {'max_features': 1, 'n_estimators': 25} -14129.505213 -24516.056914 -10649.406165 -27980.354563 -27066.294890 -20868.323549 7101.263798 20
2 0.519160 0.003526 0.035556 0.000450 1 50 {'max_features': 1, 'n_estimators': 50} -14048.048029 -24529.450849 -10680.936105 -27865.459466 -27057.329039 -20836.244698 7084.651423 13
3 0.818000 0.017073 0.054218 0.002030 1 75 {'max_features': 1, 'n_estimators': 75} -14005.614738 -24534.834379 -10295.045845 -28000.288810 -27077.156117 -20782.587978 7234.805494 7
4 1.071697 0.034987 0.065108 0.001701 1 100 {'max_features': 1, 'n_estimators': 100} -14062.448420 -24476.717132 -11118.962404 -27920.867261 -27071.548980 -20930.108839 6965.483109 28

And the winner is…

grid_res[grid_res["mean_test_score"] \
         == \
         grid_res["mean_test_score"].max()
        ]["params"]
6    {'max_features': 2, 'n_estimators': 5}
Name: params, dtype: object

Or, we can visualise the surface generated in the grid:

grid_res[["param_n_estimators", "param_max_features"]]\
        .astype(float)\
        .plot.scatter("param_n_estimators",
                      "param_max_features",
                      c=grid_res["mean_test_score"],
                      cmap="viridis"
                     );
../../_images/09-Overfitting_cv_62_0.png

It is also possible to see that, as expected, the more trees, the longer it takes to compute:

grid_res[["param_n_estimators", "mean_fit_time"]]\
        .astype(float)\
        .plot.scatter("param_n_estimators",
                      "mean_fit_time"
                     );
../../_images/09-Overfitting_cv_64_0.png

EXERCISE What’s the model performance (R/MSE) of the random forest setup we’ve used with the optimised set of parameters from the experiment above?


More at: