Scikit-learn labelencoder: how to keep mappings between batches?

Question

Scikit-learn labelencoder: how to keep mappings between batches?

I have 185 million samples which will be approximately 3.8MB per sample. To prepare my dataset, I will need to code many functions with one hot code, after which I get over 15,000 functions.

But I need to prepare the dataset in batches since the memory is over 100GB for just the functions alone when one hot encoding only uses 3 million samples.

The question is how to keep encodings / mappings / labels between batches? Packages do not have to have all levels of a category. That is, party number 1 can have: Paris, Tokyo, Rome.
Package # 2 can have Paris, London. But in the end I need Paris, Tokyo, Rome, London all to map to the same encoding at once.

Assuming I can't figure out the levels of the Cities column of 185 million all at once, since it won't fit in RAM, what should I do? If I apply the same Labelencoder instance to different batches, will the mappings remain the same? After that I also need to use one hot encoding either with scikitlearn or with keras np_utilities_to_categorical. The same question: how, in principle, can these three methods be used in batches or applied directly to a file format stored on disk?

0

python-2.7 scikit-learn

user798719 May 15 '17 @ 2:45 am

source to share

1 answer

Max Power · Accepted Answer · 2017-05-15T03:22:13+0000

I suggest using Pandas' get_dummies()

for this, since sklearn OneHotEncoder()

needs to see all possible categorical values when .fit()

, otherwise it will throw an error when it encounters a new one within .transform()

.

# Create toy dataset and split to batches
data_column = pd.Series(['Paris', 'Tokyo', 'Rome', 'London', 'Chicago', 'Paris'])
batch_1 = data_column[:3]
batch_2 = data_column[3:]

# Convert categorical feature column to matrix of dummy variables
batch_1_encoded = pd.get_dummies(batch_1, prefix='City')
batch_2_encoded = pd.get_dummies(batch_2, prefix='City')

# Row-bind (append) Encoded Data Back Together
final_encoded = pd.concat([batch_1_encoded, batch_2_encoded], axis=0)

# Final wrap-up. Replace nans with 0, and convert flags from float to int
final_encoded = final_encoded.fillna(0)
final_encoded[final_encoded.columns] = final_encoded[final_encoded.columns].astype(int)

final_encoded

Output

   City_Chicago  City_London  City_Paris  City_Rome  City_Tokyo
0             0            0           1          0           0
1             0            0           0          0           1
2             0            0           0          1           0
3             0            1           0          0           0
4             1            0           0          0           0
5             0            0           1          0           0

Scikit-learn labelencoder: how to keep mappings between batches?

More articles: