Coding / factoring lists in pandas dataframe

I am trying to code lists of categories within a dataframe by factoring them. I will then create a matrix from this series of lists (normalizing them to a given length, creating a multidimensional array, and warming up the encoding of the elements in the matrix).

However, factors do not maintain consistency between lines. This can be seen here:

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [ ['Other', 'Male', 'Female', 'Male', 'Other'], ['Female', 'Other', 'Male'] ]})
>>> df['B'] = df.A.apply(lambda x: pd.factorize(x)[0])
>>> df
                                    A                B
0  [Other, Male, Female, Male, Other]  [0, 1, 2, 1, 0]
1               [Female, Other, Male]        [0, 1, 2]

      

Does anyone know how to maintain an encoding for this series that is the same across strings?

+3


source to share


2 answers


You can use LabelEncoder

from sklearn:

Install the encoder:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit([s for l in df.A for s in l])

      



Convert Column:

df.A.apply(le.transform)
#0    [2, 1, 0, 1, 2]
#1          [0, 2, 1]
#Name: A, dtype: object

le.classes_
#array(['Female', 'Male', 'Other'], 
#      dtype='<U6')

      

+4


source


You can easily do this yourself using all the values ​​in the column A

.

First, use a collection definition to create a collection of all unique items in a column A

. Then use a dictionary comprehension where keys are those unique elements and values ​​are listed based on those unique elements sorted.



Then find the elements in that dictionary using a list comprehension.

s = set(item for sublist in df.A for item in sublist)
s = {k: n for n, k in enumerate(sorted(list(s)))}

>>> df.assign(B=[[s[key] for key in sublist] for sublist in df['A']])
                                    A                B
0  [Other, Male, Female, Male, Other]  [2, 1, 0, 1, 2]
1               [Female, Other, Male]        [0, 2, 1]

      

+3


source







All Articles