Insulation wood

Question

Insulation wood

I am currently working on identifying outliers in my dataset using the IsolationForest method in Python, but I don't quite understand the sklearn example:

http://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html#sphx-glr-auto-examples-ensemble-plot-isolation-forest-py

Specifically, what does the graph actually show? Observations have already been defined as normal / outliers, so my guess is that the shadow of the contour plot indicates if this observation is indeed an outlier (e.g., are the observations with higher anomaly readings in darker shaded areas?).

Finally, how is the next section of code used (specifically the y_pred function)?

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

I'm guessing this was just provided for completeness in case anyone wants to print the output?

Thanks in advance for your help!

+3

python scikit-learn anomaly-detection outliers

bosbraves 06 jul. 17 at 14:20

source to share

1 answer

seralouk · Accepted Answer · 2017-07-06T14:39:14+0000

Code Usage

After your code just prints y_pred_outliers :

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers) 

print(y_pred_outliers)

Thus, for each observation, it indicates whether ( +1 or -1 ) it should be treated as an outlier according to the established model.

Simple example using Iris data

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)
data = load_iris()

X=data.data
y=data.target
X_outliers = rng.uniform(low=-4, high=4, size=(X.shape[0], X.shape[1]))

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0)

clf = IsolationForest()
clf.fit(X_train)

y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

print(y_pred_test)
print(y_pred_outliers)

Result:

[ 1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1]

[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]

Interpretation:

print(y_pred_test)

only returns 1 . This means that all X_test samples do not exceed .

On the other hand, it print(y_pred_outliers)

only returns -1 . This means that all samples (150 in total for aperture data) X_outliers are outliers.

Hope it helps

Insulation wood

More articles: