Column column notation in Pandas for machine learning

Question

Column column notation in Pandas for machine learning

I am working on a machine learning evacuation dataset and the dataset is like this

buying,maint,doors,persons,lug_boot,safety,class
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc

I want to convert these strings to unique enumerated integers column by column. i see pandas.factorize () is the way to go but it only works on one column. how i factor a DataFrame at a time with one command.

I tried the lambda function and it doesn't work.

df.apply (lambda c: pd.factorize (c), axis = 1)

Output:

   0     ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, low,...

    1     ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, med,...

    2     ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, high...

    3     ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, low, u...

    4       ([0, 0, 1, 1, 2, 2, 3], [vhigh, 2, med, unacc])

    5     ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, high, ...

I can see the encoded values, but cannot pull this from the array above.

+3

python pandas scikit-learn machine-learning

pbu 27 Aug 14 at 14:56

source to share

1 answer

TomAugspurger · Accepted Answer · 2014-08-27T15:20:46+0000

Factorize returns a tuple (values, labels). You just need the values in the DataFrame.

In [26]: cols = ['buying', 'maint', 'lug_boot', 'safety', 'class']

In [27]: df[cols].apply(lambda x: pd.factorize(x)[0])
Out[27]: 
   buying  maint  lug_boot  safety  class
0       0      0         0       0      0
1       0      0         0       1      0
2       0      0         0       2      0
3       0      0         1       0      0
4       0      0         1       1      0
5       0      0         1       2      0

Then specify it numerically.

A word of warning though: it means that "low" safety and "high" safety are at the same distance from safety "med". You might be better off using pd.get_dummies

:

In [37]: dummies = []

In [38]: for col in cols:
   ....:     dummies.append(pd.get_dummies(df[col]))
   ....:     

In [39]: pd.concat(dummies, axis=1)
Out[39]: 
   vhigh  vhigh  med  small  high  low  med  unacc
0      1      1    0      1     0    1    0      1
1      1      1    0      1     0    0    1      1
2      1      1    0      1     1    0    0      1
3      1      1    1      0     0    1    0      1
4      1      1    1      0     0    0    1      1
5      1      1    1      0     1    0    0      1

get_dummies

has some optional parameters for naming control that you probably want.

Column column notation in Pandas for machine learning

More articles: