Single step categorical variables and scalable continuous variables at the same time

I am confused because it will be a problem if you do first OneHotEncoder

and then StandardScaler

because the scaler also scales the columns previously converted with OneHotEncoder

. Is there a way to do the encoding and scaling at the same time and then merge the results together?

+9


source to share


4 answers


Sure. Just individually and with one hot coding of individual columns as needed:



# Import libraries and download example data
from sklearn.preprocessing import StandardScaler, OneHotEncoder

dataset = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
print(dataset.head(5))

# Define which columns should be encoded vs scaled
columns_to_encode = ['rank']
columns_to_scale  = ['gre', 'gpa']

# Instantiate encoder/scaler
scaler = StandardScaler()
ohe    = OneHotEncoder(sparse=False)

# Scale and Encode Separate Columns
scaled_columns  = scaler.fit_transform(dataset[columns_to_scale]) 
encoded_columns =    ohe.fit_transform(dataset[columns_to_encode])

# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)

      

+14


source


Scikit-learn from version 0.20 provides sklearn.compose.ColumnTransformer

for creating a mixed column converter . You can scale numeric functions and code categorical functions uniquely together. Below is an official example (you can find the code here ):

# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause

from __future__ import print_function

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

      



Caution : this method is EXPERIMENTAL, some behaviors may change between releases without being deprecated.

+2


source


Currently, there are many methods to achieve the result required by FP. 3 ways to do it

  1. np.concatenate()

    - see this answer to the already asked OP question

  2. ...scikit-learn

    ColumnTransformer

  3. scikit-learn

    FeatureUnion

Using the example posted by @Max Power here , below is a minimal working snippet that accomplishes what the OP is looking for and concatenates the transformed columns into a single Pandas dataframe. The output of all 3 approaches is shown

Common code for all 3 methods:

import numpy as np
import pandas as pd

# Import libraries and download example data
from sklearn.preprocessing import StandardScaler, OneHotEncoder

dataset = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

# Define which columns should be encoded vs scaled
columns_to_encode = ['rank']
columns_to_scale  = ['gre', 'gpa']

# Instantiate encoder/scaler
scaler = StandardScaler()
ohe    = OneHotEncoder(sparse=False)

      

Method 1. see the code here . To show the output you can use

print pd.DataFrame(processed_data).head()

      

The result of method 1.

          0         1    2    3    4    5
0 -1.800263  0.579072  0.0  0.0  1.0  0.0
1  0.626668  0.736929  0.0  0.0  1.0  0.0
2  1.840134  1.605143  1.0  0.0  0.0  0.0
3  0.453316 -0.525927  0.0  0.0  0.0  1.0
4 -0.586797 -1.209974  0.0  0.0  0.0  1.0

      

Method 2.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


p = Pipeline([
    ('coltransformer', ColumnTransformer(transformers=[
        ('assessments', Pipeline([
            ('scale', scaler),
                                ]), columns_to_scale),
        ('ranks', Pipeline([
            ('encode', ohe),
                                ]), columns_to_encode),
                                    ]),
    ),
                ])

print(pd.DataFrame(p.fit_transform(dataset)).head())

      

The result of method 2.

          0         1    2    3    4    5
0 -1.800263  0.579072  0.0  0.0  1.0  0.0
1  0.626668  0.736929  0.0  0.0  1.0  0.0
2  1.840134  1.605143  1.0  0.0  0.0  0.0
3  0.453316 -0.525927  0.0  0.0  0.0  1.0
4 -0.586797 -1.209974  0.0  0.0  0.0  1.0

      

Method 3.

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion


class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
    def fit(self, x, y=None):
        return self
    def transform(self, df):
        return df[self.key]

p = Pipeline([
    ("union", FeatureUnion(
        transformer_list=[
            ('assessments', Pipeline([
                ('selector', ItemSelector(key=columns_to_scale)),
                ('scale', scaler),
                                ])
            ),
            ('ranks', Pipeline([
                ('selector', ItemSelector(key=columns_to_encode)),
                ('encode', ohe),
                                ])
            ),
                        ]
                        )
    ),
])

print(pd.DataFrame(p.fit_transform(dataset)).head())

      

The result of method 3.

          0         1    2    3    4    5
0 -1.800263  0.579072  0.0  0.0  1.0  0.0
1  0.626668  0.736929  0.0  0.0  1.0  0.0
2  1.840134  1.605143  1.0  0.0  0.0  0.0
3  0.453316 -0.525927  0.0  0.0  0.0  1.0
4 -0.586797 -1.209974  0.0  0.0  0.0  1.0

      

Explanation

  1. Method 1 has already been explained.

  2. Methods 2. and 3. accept a complete set of data, but perform only certain actions on subsets of the data. Modified / processed subsets are merged (merged) into the final output.

More details

pandas==0.23.4
numpy==1.15.2
scikit-learn==0.20.0

      

Additional Notes

The 3 methods shown here are probably not the only possibilities ... I'm sure there are other methods out there to do this.

USED ​​SOURCE

Updated dataset link binary.csv

+2


source


It is impossible to get your point, because it is OneHotEncoder

used for nominal data, but used for numerical data. Therefore, you should not use them together for your data.

0


source







All Articles