Single step categorical variables and scalable continuous variables at the same time
I am confused because it will be a problem if you do first OneHotEncoder
and then StandardScaler
because the scaler also scales the columns previously converted with OneHotEncoder
. Is there a way to do the encoding and scaling at the same time and then merge the results together?
source to share
Sure. Just individually and with one hot coding of individual columns as needed:
# Import libraries and download example data
from sklearn.preprocessing import StandardScaler, OneHotEncoder
dataset = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
print(dataset.head(5))
# Define which columns should be encoded vs scaled
columns_to_encode = ['rank']
columns_to_scale = ['gre', 'gpa']
# Instantiate encoder/scaler
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False)
# Scale and Encode Separate Columns
scaled_columns = scaler.fit_transform(dataset[columns_to_scale])
encoded_columns = ohe.fit_transform(dataset[columns_to_encode])
# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)
source to share
Scikit-learn from version 0.20 provides sklearn.compose.ColumnTransformer
for creating a mixed column converter . You can scale numeric functions and code categorical functions uniquely together. Below is an official example (you can find the code here ):
# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause
from __future__ import print_function
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
np.random.seed(0)
# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)
# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.
# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
X = data.drop('survived', axis=1)
y = data['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
Caution : this method is EXPERIMENTAL, some behaviors may change between releases without being deprecated.
source to share
Currently, there are many methods to achieve the result required by FP. 3 ways to do it
np.concatenate()
- see this answer to the already asked OP question...
scikit-learn
ColumnTransformer
- originally suggested in this SO answer to the OP's question
-
- also shown in this SO answer
Using the example posted by @Max Power here , below is a minimal working snippet that accomplishes what the OP is looking for and concatenates the transformed columns into a single Pandas dataframe. The output of all 3 approaches is shown
Common code for all 3 methods:
import numpy as np
import pandas as pd
# Import libraries and download example data
from sklearn.preprocessing import StandardScaler, OneHotEncoder
dataset = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# Define which columns should be encoded vs scaled
columns_to_encode = ['rank']
columns_to_scale = ['gre', 'gpa']
# Instantiate encoder/scaler
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False)
Method 1. see the code here . To show the output you can use
print pd.DataFrame(processed_data).head()
The result of method 1.
0 1 2 3 4 5
0 -1.800263 0.579072 0.0 0.0 1.0 0.0
1 0.626668 0.736929 0.0 0.0 1.0 0.0
2 1.840134 1.605143 1.0 0.0 0.0 0.0
3 0.453316 -0.525927 0.0 0.0 0.0 1.0
4 -0.586797 -1.209974 0.0 0.0 0.0 1.0
Method 2.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
p = Pipeline([
('coltransformer', ColumnTransformer(transformers=[
('assessments', Pipeline([
('scale', scaler),
]), columns_to_scale),
('ranks', Pipeline([
('encode', ohe),
]), columns_to_encode),
]),
),
])
print(pd.DataFrame(p.fit_transform(dataset)).head())
The result of method 2.
0 1 2 3 4 5
0 -1.800263 0.579072 0.0 0.0 1.0 0.0
1 0.626668 0.736929 0.0 0.0 1.0 0.0
2 1.840134 1.605143 1.0 0.0 0.0 0.0
3 0.453316 -0.525927 0.0 0.0 0.0 1.0
4 -0.586797 -1.209974 0.0 0.0 0.0 1.0
Method 3.
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, df):
return df[self.key]
p = Pipeline([
("union", FeatureUnion(
transformer_list=[
('assessments', Pipeline([
('selector', ItemSelector(key=columns_to_scale)),
('scale', scaler),
])
),
('ranks', Pipeline([
('selector', ItemSelector(key=columns_to_encode)),
('encode', ohe),
])
),
]
)
),
])
print(pd.DataFrame(p.fit_transform(dataset)).head())
The result of method 3.
0 1 2 3 4 5
0 -1.800263 0.579072 0.0 0.0 1.0 0.0
1 0.626668 0.736929 0.0 0.0 1.0 0.0
2 1.840134 1.605143 1.0 0.0 0.0 0.0
3 0.453316 -0.525927 0.0 0.0 0.0 1.0
4 -0.586797 -1.209974 0.0 0.0 0.0 1.0
Explanation
Method 1 has already been explained.
Methods 2. and 3. accept a complete set of data, but perform only certain actions on subsets of the data. Modified / processed subsets are merged (merged) into the final output.
More details
pandas==0.23.4
numpy==1.15.2
scikit-learn==0.20.0
Additional Notes
The 3 methods shown here are probably not the only possibilities ... I'm sure there are other methods out there to do this.
USED SOURCE
source to share