Model selection

Guidelines
1. Size of dataset (ANN require more data)
2. Interpretability (explain stakeholders)
3. Flexibility (making fewer assumptions)

`train_test_split`¶

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # (1)

stratify here, ensures that the distribution of y is homogenous in the train test split (if y is made up of 20% yes and 80% no, then even the train and test datasets will comprise of 20% yes and 80% no)

`cross_val_score` and `KFold`¶

kf = KFold(n_splits=6, shuffle=True, random_state=5) # (1)
cv_scores = cross_val_score(model, X, y, cv=kf, scoring="neg_mean_squared_error") # (2)

KFold creates folds of the data based on the number of splits.
Gives a list of the same metric calculated for the kf folds. We have a choice of changing the scoring metric.

`GridSearchCV` and `RandomizedSearchCV`¶

We do CV to select parameters and make the process of this selection independent of the choice of the training subset.

# Set up the parameter grid
param_grid = {"alpha": np.linspace(0.00001, 1, 20)} # (2)

# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf) # (1)

print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_)) # (3)
print("Tuned lasso score: {}".format(lasso_cv.best_score_)) # (4)

We have to pass in the model, the parameter grid and folds for cross-validation.
Multiple parameters can be passed to the grid which will check for all the possible combinations.
Gives us the best selected parameters from the grid search.
Gives us the score corresponding to the best selected parameters.

GridSearchCV is nice, but computationally expensive. RandomizedSearchCV does the same thing, but chooses parameters at random instead of checking each and every one of them.

Evaluating Classification Models¶

Training set performance

models = {
    "logreg": LogisticRegression(),
    "KNN": KNeighborsClassifier(),
    "dectree": DecisionTreeClassifier(),
}

results = []

for model in models.values():
    kf = KFold(n_splits=6, random_state=42, shuffle=Tree)
    cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
    results.append(cv_results)

plt.boxplot(results, labels=models.keys())  
plt.show()

Test set performance

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    print("{} Test Set Accuracy: {}".format(name, test_score))

Model selection

train_test_split¶

cross_val_score and KFold¶

GridSearchCV and RandomizedSearchCV¶

Evaluating Classification Models¶

`train_test_split`¶

`cross_val_score` and `KFold`¶

`GridSearchCV` and `RandomizedSearchCV`¶