LabelEncoder - inverse and use of categorical data by model

I am working on a prediction project (for fun) and basically I was pulling the names of men and women from nltk, the names of the labels as "male" or "female" and then getting the last letter of each name and eventually using different algorithms of the machine learning for learning and gender prediction based on the last letter.

So, we know Python sklearn does NOT handle categorical data, so I used LabelEncoder to convert the last letter to numeric values:

Before conversion:

     name     last_letter    gender
0    Aamir    r              male
1    Aaron    n              male
2    Abbey    y              male
3    Abbie    e              male
4    Abbot    t              male

     name       last_letter    gender
0    Abagael    l              female
1    Abagail    l              female
2    Abbe       e              female
3    Abbey      y              female
4    Abbi       i              female

      

And if we concatenate two dataframes, drop the name column and shuffle them:

     last_letter    gender
0    a              male
1    e              female
2    g              male
3    h              male
4    e              male

      

Then I used LabelEncoder

:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

for col in df.columns:
    df[col]= label_encoder.fit_transform(df[col])
df.head()

      

The information frame becomes:

     last_letter    gender
0    1              male
1    5              female
2    7              male
3    8              male
4    5              male

      

As you can see after training the model (let's say Random Forest here). If I want to use a model to predict a random letter

e.g. rf_model.predict('a')

      

This will not work as the model only accepts numeric values. In this case, if I do this:

rf_model.predict(1) (assume letter 'a' is encoded as number 1)

      

Model prediction result returns

array([1])

      

So how do I do something like:

rf_model.predict('a') 

      

and get the result as "woman" or "man" instead of entering a numeric value and getting the result as a numeric value?

+3


source to share


1 answer


Just use the one LabelEncoder

you created! Since you are already fit

with training data, you can directly apply the new data using a function transform

.

In [2]: from sklearn.preprocessing import LabelEncoder

In [3]: label_encoder = LabelEncoder()

In [4]: label_encoder.fit_transform(['a', 'b', 'c'])
Out[4]: array([0, 1, 2])

In [5]: label_encoder.transform(['a'])
Out[5]: array([0])

      

To use it with RandomForestClassifier

,



In [59]: from sklearn.ensemble import RandomForestClassifier

In [60]: X = ['a', 'b', 'c']

In [61]: y = ['male', 'female', 'female']

In [62]: X_encoded = label_encoder.fit_transform(X)

In [63]: rf_model = RandomForestClassifier()

In [64]: rf_model.fit(X_encoded[:, None], y)
Out[64]: 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [65]: x = ['a']

In [66]: x_encoded = label_encoder.transform(x)

In [67]: rf_model.predict(x_encoded[:, None])
Out[67]: 
array(['male'], 
      dtype='<U6')

      

As you can see, you can get the string output 'male', 'female'

directly from the classifier if you used them to match the classifier.

Talk to LabelEncoder.transform

+2


source







All Articles