Skip to content

Model selection

  • Guidelines
    1. Size of dataset (ANN require more data)
    2. Interpretability (explain stakeholders)
    3. Flexibility (making fewer assumptions)

train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # (1)
  1. stratify here, ensures that the distribution of y is homogenous in the train test split (if y is made up of 20% yes and 80% no, then even the train and test datasets will comprise of 20% yes and 80% no)

cross_val_score and KFold

kf = KFold(n_splits=6, shuffle=True, random_state=5) # (1)
cv_scores = cross_val_score(model, X, y, cv=kf, scoring="neg_mean_squared_error") # (2)
  1. KFold creates folds of the data based on the number of splits.
  2. Gives a list of the same metric calculated for the kf folds. We have a choice of changing the scoring metric.

GridSearchCV and RandomizedSearchCV

  • We do CV to select parameters and make the process of this selection independent of the choice of the training subset.
# Set up the parameter grid
param_grid = {"alpha": np.linspace(0.00001, 1, 20)} # (2)

# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf) # (1)

print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_)) # (3)
print("Tuned lasso score: {}".format(lasso_cv.best_score_)) # (4)
  1. We have to pass in the model, the parameter grid and folds for cross-validation.
  2. Multiple parameters can be passed to the grid which will check for all the possible combinations.
  3. Gives us the best selected parameters from the grid search.
  4. Gives us the score corresponding to the best selected parameters.

GridSearchCV is nice, but computationally expensive. RandomizedSearchCV does the same thing, but chooses parameters at random instead of checking each and every one of them.


Evaluating Classification Models

Training set performance

models = {
    "logreg": LogisticRegression(),
    "KNN": KNeighborsClassifier(),
    "dectree": DecisionTreeClassifier(),
}

results = []

for model in models.values():
    kf = KFold(n_splits=6, random_state=42, shuffle=Tree)
    cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
    results.append(cv_results)

plt.boxplot(results, labels=models.keys())  
plt.show()

Test set performance

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    print("{} Test Set Accuracy: {}".format(name, test_score))