LabelEncoder - inverse and use of categorical data by model

Question

LabelEncoder - inverse and use of categorical data by model

I am working on a prediction project (for fun) and basically I was pulling the names of men and women from nltk, the names of the labels as "male" or "female" and then getting the last letter of each name and eventually using different algorithms of the machine learning for learning and gender prediction based on the last letter.

So, we know Python sklearn does NOT handle categorical data, so I used LabelEncoder to convert the last letter to numeric values:

Before conversion:

     name     last_letter    gender
0    Aamir    r              male
1    Aaron    n              male
2    Abbey    y              male
3    Abbie    e              male
4    Abbot    t              male

     name       last_letter    gender
0    Abagael    l              female
1    Abagail    l              female
2    Abbe       e              female
3    Abbey      y              female
4    Abbi       i              female

And if we concatenate two dataframes, drop the name column and shuffle them:

     last_letter    gender
0    a              male
1    e              female
2    g              male
3    h              male
4    e              male

Then I used LabelEncoder

:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

for col in df.columns:
    df[col]= label_encoder.fit_transform(df[col])
df.head()

The information frame becomes:

     last_letter    gender
0    1              male
1    5              female
2    7              male
3    8              male
4    5              male

As you can see after training the model (let's say Random Forest here). If I want to use a model to predict a random letter

e.g. rf_model.predict('a')

This will not work as the model only accepts numeric values. In this case, if I do this:

rf_model.predict(1) (assume letter 'a' is encoded as number 1)

Model prediction result returns

array([1])

So how do I do something like:

rf_model.predict('a')

and get the result as "woman" or "man" instead of entering a numeric value and getting the result as a numeric value?

+3

python scikit-learn machine-learning label encoder

thatMeow June 25. 17 at 16:38

source to share

1 answer

frankyjuang · Accepted Answer · 2017-06-25T21:03:21+0000

Just use the one LabelEncoder

you created! Since you are already fit

with training data, you can directly apply the new data using a function transform

.

In [2]: from sklearn.preprocessing import LabelEncoder

In [3]: label_encoder = LabelEncoder()

In [4]: label_encoder.fit_transform(['a', 'b', 'c'])
Out[4]: array([0, 1, 2])

In [5]: label_encoder.transform(['a'])
Out[5]: array([0])

To use it with RandomForestClassifier

,

In [59]: from sklearn.ensemble import RandomForestClassifier

In [60]: X = ['a', 'b', 'c']

In [61]: y = ['male', 'female', 'female']

In [62]: X_encoded = label_encoder.fit_transform(X)

In [63]: rf_model = RandomForestClassifier()

In [64]: rf_model.fit(X_encoded[:, None], y)
Out[64]: 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [65]: x = ['a']

In [66]: x_encoded = label_encoder.transform(x)

In [67]: rf_model.predict(x_encoded[:, None])
Out[67]: 
array(['male'], 
      dtype='<U6')

As you can see, you can get the string output 'male', 'female'

directly from the classifier if you used them to match the classifier.

Talk to LabelEncoder.transform

LabelEncoder - inverse and use of categorical data by model

More articles: