Python - binary encoding a column containing multiple terms
I need to do a binary conversion of a column containing lists of strings separated comma
.
You can help me get from here:
df = pd.DataFrame({'_id': [1,2,3],
'test': [['one', 'two', 'three'],
['three', 'one'],
['four', 'one']]})
df
_id test
1 [one, two, three]
2 [three, one]
3 [four, one]
in
df_result = pd.DataFrame({'_id': [1,2,3],
'one': [1,1,1],
'two': [1,0,0],
'three': [1,1,0],
'four': [0,0,1]})
df_result[['_id', 'one', 'two', 'three', 'four']]
_id one two three four
1 1 1 1 0
2 1 0 1 0
3 1 0 0 1
Any help would be much appreciated!
source to share
You can use str.get_dummies
, pop
to extract out of the column, to convert to str
using str.join
and for last join
:
df = df.join(df.pop('test').str.join('|').str.get_dummies())
print (df)
_id four one three two
0 1 0 1 1 1
1 2 0 1 1 0
2 3 1 1 0 0
Instead, pop
you can use drop
:
df = df.drop('test', axis=1).join(df.pop('test').str.join('|').str.get_dummies())
print (df)
_id four one three two
0 1 0 1 1 1
1 2 0 1 1 0
2 3 1 1 0 0
Solution with new DataFrame
:
df1 = pd.get_dummies(pd.DataFrame(df.pop('test').values.tolist()), prefix='', prefix_sep='')
df = df.join(df1.groupby(level=0, axis=1).max())
print (df)
_id four one three two
0 1 0 1 1 1
1 2 0 1 1 0
2 3 1 1 0 0
I also try to fix the conversion in string
to astype
, but needs some cleaning:
df1=df.pop('test').astype(str).str.strip("'[]").str.replace("',\s+'", '|').str.get_dummies()
df = df.join(df1)
print (df)
_id four one three two
0 1 0 1 1 1
1 2 0 1 1 0
2 3 1 1 0 0
source to share
We can use sklearn.preprocessing.MultiLabelBinarizer method:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('test')),
columns=mlb.classes_,
index=df.index))
Result:
In [15]: df
Out[15]:
_id four one three two
0 1 0 1 1 1
1 2 0 1 1 0
2 3 1 1 0 0
source to share