Python - binary encoding a column containing multiple terms

I need to do a binary conversion of a column containing lists of strings separated comma

.

You can help me get from here:

df = pd.DataFrame({'_id': [1,2,3],
                   'test': [['one', 'two', 'three'], 
                            ['three', 'one'], 
                            ['four', 'one']]})
df

_id  test
 1   [one, two, three]
 2   [three, one]
 3   [four, one]

      

in

df_result = pd.DataFrame({'_id': [1,2,3], 
                          'one': [1,1,1], 
                          'two': [1,0,0], 
                          'three': [1,1,0], 
                          'four': [0,0,1]})

df_result[['_id', 'one', 'two', 'three', 'four']]

 _id    one two  three  four
   1    1   1    1      0
   2    1   0    1      0
   3    1   0    0      1

      

Any help would be much appreciated!

+3


source to share


2 answers


You can use str.get_dummies

, pop

to extract out of the column, to convert to str

using str.join

and for last join

:

df = df.join(df.pop('test').str.join('|').str.get_dummies())
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

      

Instead, pop

you can use drop

:

df = df.drop('test', axis=1).join(df.pop('test').str.join('|').str.get_dummies())
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

      



Solution with new DataFrame

:

df1 = pd.get_dummies(pd.DataFrame(df.pop('test').values.tolist()), prefix='', prefix_sep='')
df = df.join(df1.groupby(level=0, axis=1).max())
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

      

I also try to fix the conversion in string

to astype

, but needs some cleaning:

df1=df.pop('test').astype(str).str.strip("'[]").str.replace("',\s+'", '|').str.get_dummies()
df = df.join(df1)
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

      

+3


source


We can use sklearn.preprocessing.MultiLabelBinarizer method:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('test')),
                          columns=mlb.classes_,
                          index=df.index))

      



Result:

In [15]: df
Out[15]:
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

      

+2


source







All Articles