Model selection
- Guidelines
- Size of dataset (ANN require more data)
- Interpretability (explain stakeholders)
- Flexibility (making fewer assumptions)
train_test_split¶
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # (1)
stratifyhere, ensures that the distribution ofyis homogenous in the train test split (ifyis made up of 20% yes and 80% no, then even the train and test datasets will comprise of 20% yes and 80% no)
cross_val_score and KFold¶
kf = KFold(n_splits=6, shuffle=True, random_state=5) # (1)
cv_scores = cross_val_score(model, X, y, cv=kf, scoring="neg_mean_squared_error") # (2)
KFoldcreates folds of the data based on the number of splits.- Gives a list of the same metric calculated for the
kffolds. We have a choice of changing thescoringmetric.
GridSearchCV and RandomizedSearchCV¶
- We do CV to select parameters and make the process of this selection independent of the choice of the training subset.
# Set up the parameter grid
param_grid = {"alpha": np.linspace(0.00001, 1, 20)} # (2)
# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf) # (1)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_)) # (3)
print("Tuned lasso score: {}".format(lasso_cv.best_score_)) # (4)
- We have to pass in the model, the parameter grid and folds for cross-validation.
- Multiple parameters can be passed to the grid which will check for all the possible combinations.
- Gives us the best selected parameters from the grid search.
- Gives us the score corresponding to the best selected parameters.
GridSearchCV is nice, but computationally expensive. RandomizedSearchCV does the same thing, but chooses parameters at random instead of checking each and every one of them.
Evaluating Classification Models¶
Training set performance
models = {
"logreg": LogisticRegression(),
"KNN": KNeighborsClassifier(),
"dectree": DecisionTreeClassifier(),
}
results = []
for model in models.values():
kf = KFold(n_splits=6, random_state=42, shuffle=Tree)
cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
results.append(cv_results)
plt.boxplot(results, labels=models.keys())
plt.show()
Test set performance
for name, model in models.items():
model.fit(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
print("{} Test Set Accuracy: {}".format(name, test_score))