1.5 - Data Scaling and Splitting
This scaling is particularly important for algorithms sensitive to feature magnitude.
Once data is encoded it is time to scale the features into single unit by making it a mean of '0' and a standard deviation of '1'. It is used when features have different units or scales.
Advantages
- Prevent Domination: Features with large ranges (e.g.: income in dollars) might dominate smaller ones (e.g.: age)
- Algorithm Compatibility: Many ML models (e.g.: KNN, SVM, neural networks) rely on distance metrics or gradients, which requires scaled features.
- Improve convergence speed: Helps models like gradient descent based algorithms converge faster during training.
Types of scaling
- Normalization: Scale data to a [0, 1] range (e.g.: Min-Max scaling)
- Standardization: Scale data to have zero mean and unit variance(z-score)
- Log Transformation: Handle skewed distributions by applying the log function.
Standardization technique
step-1: Split the X and y into training set and test set.
# first split the data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape # (59598, 29)
X_test.shape # (14900, 29)
step-2: Apply scaling to train and test datasets.
# scaling the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_test - used for testing if the model is predicted correct output or not
test-size - took 20% of data for testing. Ideally 70% - train and 30% - test datasets are taken into account.
random_size: By setting a
random_state
, you can ensure that your code produces the same results every time it is run, which is crucial for reproducibility.Why used fit_transform
on X_train
?
fit
: calculates the mean and standard deviation from given dataset. These statistics are then stored for use in future transformations.
transform
: uses previously (fit
) calculated statistics to scale the dataset without re-calculating them.
The model should only "learn" from training data. When fit_transform
on X_train
, the StandardScaler
calculates the mean and standard deviation of the training data and applies scaling using those statistics.
Why transform
for X_test
?
To prevent data leakage, the statistics (mean and standard deviation) calculated from the training data are applied to the test data.
You should not call fit_transform
on test data because:
- It would compute new statistics for the test data, introducing information that wouldn't naturally be available during training.
- This violates the principle of keeping the test data unseen during the training phase.
So by calling transform
, you ensure that:
- The test data is scaled using the same parameters (mean and standard deviation) derived from the training data.
- A consistent scaling process between training and testing datasets.
Key Points
fit_transform
onX_train
: Calculates scaling parameters and scales the training data.transform
onX_test
: Applies the same scaling parameters derived fromX_train
to ensure the model sees the test data scaled consistently with the training data.
This approach maintains the integrity of your model's evaluation and ensures no information from the test set influences the training phase.