Create layouts from a column for a subset of data that does not contain all of the category values โโin that column
I am processing a subset of a large dataset.
The DataFrame has a column named "type". The "type" is expected to have values โโsuch as [1,2,3,4].
In a certain subset, I find that the "type" column only contains certain values โโlike [1,4] for example
In [1]: df
Out[2]:
type
0 1
1 4
When I create dummies from the "type" column on this subset, I get the following:
In [3]:import pandas as pd
In [4]:pd.get_dummies(df["type"], prefix = "type")
Out[5]: type_1 type_4
0 1 0
1 0 1
It has no columns named "type_2", "type_3". What I need is this:
Out[6]: type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Is there a solution for this?
source to share
Another solution with reindex_axis
and add_prefix
:
df1 = pd.get_dummies(df["type"])
.reindex_axis([1,2,3,4], axis=1, fill_value=0)
.add_prefix('type')
print (df1)
type1 type2 type3 type4
0 1 0 0 0
1 0 0 0 1
Or a categorical
solution:
df1 = pd.get_dummies(df["type"].astype('category', categories=[1, 2, 3, 4]), prefix='type')
print (df1)
type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
source to share
What you need to do is make a column 'type'
in and specify pd.Categorical
categories
pd.get_dummies(pd.Categorical(df.type, [1, 2, 3, 4]), prefix='type')
type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
source to share
Since you tagged the post as one-hot-encoding
, you may find sklearn
module OneHotEncoder
useful, besides a pure Pandas solution:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5
# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))
# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])
print(newdf)
type_0 type_1 type_2 type_3 type_4
0 0 1 0 0 0
1 0 0 0 0 1
One of the benefits of using this approach is that it OneHotEncoder
easily creates sparse vectors for very large sets of classes. (Just change to sparse=True
in the declaration OneHotEncoder()
.)
source to share