Skip to content

Preprocessing

Scaling

Many models use some form of distance to work. However, different variables may have different scales (some may be extremely large, some may be extremely small, some negative, some positive only). This can cause problems. For instance, extremely large values of some variables may disproportionately influence our model.

Thus, we adjust the data to get everything down to the same scale (for example, make it so everything relatively makes sense but fits in the scale of \((0,1)\).)

Three ways:

  1. Subtract mean, divide by variance (standardize \(\to\) Normal dist)
  2. Subtract the minimum and divide by range (\(\text{range}=\text{max}-\text{min}\)) \((0,1)\)
  3. Also can normalize data ranges from \((-1,+1)\)

Models affected by scaling (so, scale before evaluation):

  1. KNN
  2. Linear Regression (+Ridge, Lasso)
  3. Logistic Regression
  4. Artificial Neural Network

StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)