Skip to content

Impute

Split the data first! To ensure we are not leaking information about our test set to our model.

  • We use different imputation methods on numerical and categorical variables. Thus,
    1. Create X_cat and X_num.
    2. Create categorical training and test sets.
    3. Use the same random_state for numerical training and test sets.
    4. Use different imputers for X_train_cat and X_train_num

An imputer is a transformer

SimpleImputer

imp_cat = SimpleImputer(strategy="most frequent")