How can I use OneHotEncoder for multiple columns and automatically drop the first dummy variable for each column?
It is a dataset with three columns and three rows
Name Organization Department
Manie ABC2 FINANCE
Joyce ABC1 HR
Ami NSV2 HR
This is the code I have:
All is well now, how can I delete the first dummy variable column for each?
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data1.csv',encoding = "cp1252")
X = dataset.values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()
source to share
import pandas as pd
df = pd.DataFrame({'name': ['Manie', 'Joyce', 'Ami'],
'Org': ['ABC2', 'ABC1', 'NSV2'],
'Dept': ['Finance', 'HR', 'HR']
})
df_2 = pd.get_dummies(df,drop_first=True)
test:
print(df_2)
Dept_HR Org_ABC2 Org_NSV2 name_Joyce name_Manie
0 0 1 0 0 1
1 1 0 0 1 0
2 1 0 1 0 0
UPDATE regarding your error with pd.get_dummies(X, columns =[1:]
:
On the documentation page, the parameter columns
takes "Column names". Thus, the following code will work:
df_2 = pd.get_dummies(df, columns=['Org', 'Dept'], drop_first=True)
output:
name Org_ABC2 Org_NSV2 Dept_HR
0 Manie 1 0 0
1 Joyce 0 0 1
2 Ami 0 1 1
If you really want to define your columns positionally, you can do it like this:
column_names_for_onehot = df.columns[1:]
df_2 = pd.get_dummies(df, columns=column_names_for_onehot, drop_first=True)
source to share
I use my own template for this:
from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
class DataFrameEncoder(TransformerMixin):
def __init__(self):
"""Encode the data.
Columns of data type object are appended in the list. After
appending Each Column of type object are taken dummies and
successively removed and two Dataframes are concated again.
"""
def fit(self, X, y=None):
self.object_col = []
for col in X.columns:
if(X[col].dtype == np.dtype('O')):
self.object_col.append(col)
return self
def transform(self, X, y=None):
dummy_df = pd.get_dummies(X[self.object_col],drop_first=True)
X = X.drop(X[self.object_col],axis=1)
X = pd.concat([dummy_df,X],axis=1)
return X
And to use this code, just place this template in the current directory with the file name, suppose CustomeEncoder.py, and enter your code:
from customEncoder import DataFrameEncoder
data = DataFrameEncoder().fit_transormer(data)
And all object type data is removed, encoded, first removed and combined to get the final desired result.
PS: that the input file of this template is a Pandas Dataframe.
source to share
In scikit-learn, this is pretty straightforward starting in 0.21. You can use the reset option in OneHotEncoder and use it to remove one of the categories for each function. By default, it is not reset. Details can be found in the documentation.
//drops the first category in each feature
ohe = OneHotEncoder(drop='first', handle_unknown='error')
source to share