Coding / factoring lists in pandas dataframe
I am trying to code lists of categories within a dataframe by factoring them. I will then create a matrix from this series of lists (normalizing them to a given length, creating a multidimensional array, and warming up the encoding of the elements in the matrix).
However, factors do not maintain consistency between lines. This can be seen here:
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [ ['Other', 'Male', 'Female', 'Male', 'Other'], ['Female', 'Other', 'Male'] ]})
>>> df['B'] = df.A.apply(lambda x: pd.factorize(x)[0])
>>> df
A B
0 [Other, Male, Female, Male, Other] [0, 1, 2, 1, 0]
1 [Female, Other, Male] [0, 1, 2]
Does anyone know how to maintain an encoding for this series that is the same across strings?
source to share
You can use LabelEncoder
from sklearn:
Install the encoder:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit([s for l in df.A for s in l])
Convert Column:
df.A.apply(le.transform)
#0 [2, 1, 0, 1, 2]
#1 [0, 2, 1]
#Name: A, dtype: object
le.classes_
#array(['Female', 'Male', 'Other'],
# dtype='<U6')
source to share
You can easily do this yourself using all the values ββin the column A
.
First, use a collection definition to create a collection of all unique items in a column A
. Then use a dictionary comprehension where keys are those unique elements and values ββare listed based on those unique elements sorted.
Then find the elements in that dictionary using a list comprehension.
s = set(item for sublist in df.A for item in sublist)
s = {k: n for n, k in enumerate(sorted(list(s)))}
>>> df.assign(B=[[s[key] for key in sublist] for sublist in df['A']])
A B
0 [Other, Male, Female, Male, Other] [2, 1, 0, 1, 2]
1 [Female, Other, Male] [0, 2, 1]
source to share