Extract erroneous documents with scikitlearn
I am curious to see if there are built-in functions in the scikitlearn python module that can fetch erroneously formatted documents.
It's simple, I usually write myself, comparing both predicted and test vectors and extracting documents from an array of test documents. but I am asking if there is a built in functionality for it and not for copying the functionality in every pit code I write.
source to share
If you have a list of true labels y_test
for a set of documents, eg. ["ham", "spam", "spam", "ham"]
and you convert that to a NumPy array, then you can compare it to the predictions in one layer:
import numpy as np y_test = np.asarray(y_test) misclassified = np.where(y_test != clf.predict(X_test))
Now misclassified
is an array of indices in X_test
.
@eickenberg is right, this kind of thing is not implemented in scikit-learn because users are expected to be familiar enough with NumPy to do it themselves in a few lines of code.
source to share
You can get misclassifications like this with list comprehension. Otherwise, I don't know of any other way to do it in sklearn.
from sklearn.cross_validation import train_test_split
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
X_iris, y_iris = iris.data, iris.target
X, y = X_iris[:, :2], y_iris
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = svm.LinearSVC()
clf.fit(X_train, y_train)
mis_cls = [train
for test, truth, train in
zip(X_test, y_test, X_train)
if clf.predict(test) != truth]
source to share