Skip to main content

Project 01: Employee Attrition Prediction

1.5 - Data Scaling and Splitting

πŸ”–
Data Scaling - Involves adjusting the range of features to ensure that they are on a similar scale.

This scaling is particularly important for algorithms sensitive to feature magnitude.

Once data is encoded it is time to scale the features into single unit by making it a mean of '0' and a standard deviation of '1'. It is used when features have different units or scales.

Advantages

  • Prevent Domination: Features with large ranges (e.g.: income in dollars) might dominate smaller ones (e.g.: age)
  • Algorithm Compatibility: Many ML models (e.g.: KNN, SVM, neural networks) rely on distance metrics or gradients, which requires scaled features.
  • Improve convergence speed: Helps models like gradient descent based algorithms converge faster during training.

Types of scaling

  1. Normalization: Scale data to a [0, 1] range (e.g.: Min-Max scaling)
  2. Standardization: Scale data to have zero mean and unit variance(z-score)
  3. Log Transformation: Handle skewed distributions by applying the log function.

Standardization technique

step-1: Split the X and y into training set and test set.

# first split the data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape # (59598, 29)
X_test.shape # (14900, 29)

step-2: Apply scaling to train and test datasets.

# scaling the features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
✍️
X_train - used for training the model
X_test - used for testing if the model is predicted correct output or not
test-size - took 20% of data for testing. Ideally 70% - train and 30% - test datasets are taken into account.
random_size: By setting a random_state, you can ensure that your code produces the same results every time it is run, which is crucial for reproducibility.

Why used fit_transform on X_train ?

fit: calculates the mean and standard deviation from given dataset. These statistics are then stored for use in future transformations.

transform: uses previously (fit) calculated statistics to scale the dataset without re-calculating them.

The model should only "learn" from training data. When fit_transform on X_train, the StandardScaler calculates the mean and standard deviation of the training data and applies scaling using those statistics.

Why transform for X_test ?

To prevent data leakage, the statistics (mean and standard deviation) calculated from the training data are applied to the test data.

You should not call fit_transform on test data because:

  • It would compute new statistics for the test data, introducing information that wouldn't naturally be available during training.
  • This violates the principle of keeping the test data unseen during the training phase.

So by calling transform, you ensure that:

  • The test data is scaled using the same parameters (mean and standard deviation) derived from the training data.
  • A consistent scaling process between training and testing datasets.

Key Points

  • fit_transform on X_train: Calculates scaling parameters and scales the training data.
  • transform on X_test: Applies the same scaling parameters derived from X_train to ensure the model sees the test data scaled consistently with the training data.

This approach maintains the integrity of your model's evaluation and ensures no information from the test set influences the training phase.