LabelEncoder - inverse and use of categorical data by model
I am working on a prediction project (for fun) and basically I was pulling the names of men and women from nltk, the names of the labels as "male" or "female" and then getting the last letter of each name and eventually using different algorithms of the machine learning for learning and gender prediction based on the last letter.
So, we know Python sklearn does NOT handle categorical data, so I used LabelEncoder to convert the last letter to numeric values:
Before conversion:
name last_letter gender
0 Aamir r male
1 Aaron n male
2 Abbey y male
3 Abbie e male
4 Abbot t male
name last_letter gender
0 Abagael l female
1 Abagail l female
2 Abbe e female
3 Abbey y female
4 Abbi i female
And if we concatenate two dataframes, drop the name column and shuffle them:
last_letter gender
0 a male
1 e female
2 g male
3 h male
4 e male
Then I used LabelEncoder
:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in df.columns:
df[col]= label_encoder.fit_transform(df[col])
df.head()
The information frame becomes:
last_letter gender
0 1 male
1 5 female
2 7 male
3 8 male
4 5 male
As you can see after training the model (let's say Random Forest here). If I want to use a model to predict a random letter
e.g. rf_model.predict('a')
This will not work as the model only accepts numeric values. In this case, if I do this:
rf_model.predict(1) (assume letter 'a' is encoded as number 1)
Model prediction result returns
array([1])
So how do I do something like:
rf_model.predict('a')
and get the result as "woman" or "man" instead of entering a numeric value and getting the result as a numeric value?
source to share
Just use the one LabelEncoder
you created! Since you are already fit
with training data, you can directly apply the new data using a function transform
.
In [2]: from sklearn.preprocessing import LabelEncoder
In [3]: label_encoder = LabelEncoder()
In [4]: label_encoder.fit_transform(['a', 'b', 'c'])
Out[4]: array([0, 1, 2])
In [5]: label_encoder.transform(['a'])
Out[5]: array([0])
To use it with RandomForestClassifier
,
In [59]: from sklearn.ensemble import RandomForestClassifier
In [60]: X = ['a', 'b', 'c']
In [61]: y = ['male', 'female', 'female']
In [62]: X_encoded = label_encoder.fit_transform(X)
In [63]: rf_model = RandomForestClassifier()
In [64]: rf_model.fit(X_encoded[:, None], y)
Out[64]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
In [65]: x = ['a']
In [66]: x_encoded = label_encoder.transform(x)
In [67]: rf_model.predict(x_encoded[:, None])
Out[67]:
array(['male'],
dtype='<U6')
As you can see, you can get the string output 'male', 'female'
directly from the classifier if you used them to match the classifier.
Talk to LabelEncoder.transform
source to share