Remove Duplicate Rows
Remove Duplicate Rows
Duplicate records inflate training data and cause models to overfit to specific examples. Unlike np.unique(axis=0) — which sorts the rows — a production pipeline must preserve insertion order so that downstream time-based logic is not disrupted.
Strategy:
1. Iterate through rows in order
2. Track seen rows using a set of tuples
3. Keep only the first occurrence of each unique row
Example:
Input: [[1, 2], [3, 4], [1, 2], [5, 6]] Output: [[1, 2], [3, 4], [5, 6]] ← insertion order preserved
The duplicate [1, 2] at index 2 is dropped; [3, 4] and [5, 6] remain in place.
Your task:
Implement remove_duplicate_rows(X) that returns a 2D NumPy array of unique rows in first-appearance order.
Example Tests
One duplicate at index 2 removed: output shape is (3, 2)
Input: {"X":[[1,2],[3,4],[1,2],[5,6]]}
Expected: [3,2]
First row of deduplicated output is correct
Input: {"X":[[1,2],[3,4],[1,2],[5,6]]}
Expected: [1,2]
Three occurrences of [1,1] collapsed to one: output shape is (3, 2)
Input: {"X":[[1,1],[2,2],[1,1],[1,1],[3,3]]}
Expected: [3,2]