Create layouts from a column for a subset of data that does not contain all of the category values โ€‹โ€‹in that column

I am processing a subset of a large dataset.

The DataFrame has a column named "type". The "type" is expected to have values โ€‹โ€‹such as [1,2,3,4].

In a certain subset, I find that the "type" column only contains certain values โ€‹โ€‹like [1,4] for example

 In [1]: df
 Out[2]:
          type
    0      1
    1      4

      

When I create dummies from the "type" column on this subset, I get the following:

In [3]:import pandas as pd
In [4]:pd.get_dummies(df["type"], prefix = "type")
Out[5]:        type_1 type_4
        0        1       0
        1        0       1

      

It has no columns named "type_2", "type_3". What I need is this:

 Out[6]:        type_1 type_2 type_3 type_4
            0      1      0       0      0
            1      0      0       0      1

      

Is there a solution for this?

+3


source to share


3 answers


Another solution with reindex_axis

and add_prefix

:

df1 = pd.get_dummies(df["type"])
        .reindex_axis([1,2,3,4], axis=1, fill_value=0)
        .add_prefix('type')
print (df1)
   type1  type2  type3  type4
0      1      0      0      0
1      0      0      0      1

      



Or a categorical

solution:

df1 = pd.get_dummies(df["type"].astype('category', categories=[1, 2, 3, 4]), prefix='type')
print (df1)
   type_1  type_2  type_3  type_4
0       1       0       0       0
1       0       0       0       1

      

+1


source


What you need to do is make a column 'type'

in and specify pd.Categorical

categories



pd.get_dummies(pd.Categorical(df.type, [1, 2, 3, 4]), prefix='type')

   type_1  type_2  type_3  type_4
0       1       0       0       0
1       0       0       0       1

      

+2


source


Since you tagged the post as one-hot-encoding

, you may find sklearn

module OneHotEncoder

useful, besides a pure Pandas solution:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5

# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))

# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])

print(newdf)

   type_0  type_1  type_2  type_3  type_4
0       0       1       0       0       0
1       0       0       0       0       1

      

One of the benefits of using this approach is that it OneHotEncoder

easily creates sparse vectors for very large sets of classes. (Just change to sparse=True

in the declaration OneHotEncoder()

.)

+2


source







All Articles