Leakage-Free Feature Standardization
Leakage-Free Feature Standardization
Standardization (z-score normalization) centers each feature to zero mean and unit variance:
The leakage trap: If you compute and over both train and test data, test information contaminates your scaler — your validation metrics will be optimistic and the model will underperform in production.
Correct approach:
1. Fit on X_train only: compute per-column and using np.mean(..., axis=0) and np.std(..., axis=0)
2. Transform X_test with the same and
Edge case: If any column has (constant), set to avoid division by zero.
Your task:
Implement scale_with_train_stats(X_train, X_test) that returns standardized X_test.
Example Tests
2-feature train range [0,2]: test point [1,3] → [0.0, 2.0]
Input: {"X_test":[[1,3]],"X_train":[[0,0],[2,2]]}
Expected: [[0,2]]
Train with unit std: test point shifted and scaled correctly
Input: {"X_test":[[5,6]],"X_train":[[1,2],[3,4]]}
Expected: [[3,3]]
Single feature: two test points standardized by train stats
Input: {"X_test":[[2],[8]],"X_train":[[2],[4],[6]]}
Expected: [[-1.22474],[2.44949]]