Target Encoding for High-Cardinality Features
Target Encoding for High-Cardinality Features
Target encoding replaces each categorical value with the mean of the target variable for that category. It is far more memory-efficient than one-hot encoding when a feature has hundreds or thousands of unique values (e.g., zip codes, product IDs).
Algorithm:
1. For each unique category , collect all target values where categories[i] == c
2. Replace:
Example:
categories = ["A", "B", "A", "C"] targets = [2, 4, 6, 8]
[4.0, 4.0, 4.0, 8.0]> Leakage warning: In production, compute encoding means from training data only and apply them at inference time.
Your task:
Implement target_encode(categories, targets) that returns a float array of the same length.
Example Tests
Three categories: A and B have equal means, C differs
Input: {"targets":[2,4,6,8],"categories":["A","B","A","C"]}
Expected: [4,4,4,8]
X appears twice — mean computed over both occurrences
Input: {"targets":[10,20,5],"categories":["X","X","Y"]}
Expected: [15,15,5]
All unique categories: each maps to its own target value
Input: {"targets":[1,2,3],"categories":["A","B","C"]}
Expected: [1,2,3]