How can I use OneHotEncoder for multiple columns and automatically drop the first dummy variable for each column?

Question

How can I use OneHotEncoder for multiple columns and automatically drop the first dummy variable for each column?

It is a dataset with three columns and three rows

Name Organization Department

Manie ABC2 FINANCE

Joyce ABC1 HR

Ami NSV2 HR

This is the code I have:

All is well now, how can I delete the first dummy variable column for each?

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data1.csv',encoding = "cp1252")
X = dataset.values


# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()

+8

python pandas scikit-learn machine-learning

Vijay June 17. 17 at 6:29 am

source to share

4 answers

Encode categorical variables one at a time. The dummy variables should go to the starting index of your dataset. Then just slice off the first column like this:

X = X[:, 1:]

Then code and repeat the next variable.

0

Roberto May 25 '18 at 11:24 PM

source to share

I use my own template for this:

from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
class DataFrameEncoder(TransformerMixin):

    def __init__(self):
        """Encode the data.

        Columns of data type object are appended in the list. After 
        appending Each Column of type object are taken dummies and 
        successively removed and two Dataframes are concated again.

        """
    def fit(self, X, y=None):
        self.object_col = []
        for col in X.columns:
            if(X[col].dtype == np.dtype('O')):
                self.object_col.append(col)
        return self

    def transform(self, X, y=None):
        dummy_df = pd.get_dummies(X[self.object_col],drop_first=True)
        X = X.drop(X[self.object_col],axis=1)
        X = pd.concat([dummy_df,X],axis=1)
        return X

And to use this code, just place this template in the current directory with the file name, suppose CustomeEncoder.py, and enter your code:

from customEncoder import DataFrameEncoder
data = DataFrameEncoder().fit_transormer(data)

And all object type data is removed, encoded, first removed and combined to get the final desired result.
PS: that the input file of this template is a Pandas Dataframe.

0

MD Rijwan 30 jul. 19 at 18:25

source to share

In scikit-learn, this is pretty straightforward starting in 0.21. You can use the reset option in OneHotEncoder and use it to remove one of the categories for each function. By default, it is not reset. Details can be found in the documentation.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

//drops the first category in each feature
ohe = OneHotEncoder(drop='first', handle_unknown='error')

0

Jyoti prasad pal 27 Sep '19 at 5:33

source to share

Max Power · Accepted Answer · 2017-06-17T07:02:43+0000

import pandas as pd
df = pd.DataFrame({'name': ['Manie', 'Joyce', 'Ami'],
                   'Org':  ['ABC2', 'ABC1', 'NSV2'],
                   'Dept': ['Finance', 'HR', 'HR']        
        })


df_2 = pd.get_dummies(df,drop_first=True)

test:

print(df_2)
   Dept_HR  Org_ABC2  Org_NSV2  name_Joyce  name_Manie
0        0         1         0           0           1
1        1         0         0           1           0
2        1         0         1           0           0

UPDATE regarding your error with pd.get_dummies(X, columns =[1:]

:

On the documentation page, the parameter columns

takes "Column names". Thus, the following code will work:

df_2 = pd.get_dummies(df, columns=['Org', 'Dept'], drop_first=True)

output:

    name  Org_ABC2  Org_NSV2  Dept_HR
0  Manie         1         0        0
1  Joyce         0         0        1
2    Ami         0         1        1

If you really want to define your columns positionally, you can do it like this:

column_names_for_onehot = df.columns[1:]
df_2 = pd.get_dummies(df, columns=column_names_for_onehot, drop_first=True)

How can I use OneHotEncoder for multiple columns and automatically drop the first dummy variable for each column?

More articles: