Extract erroneous documents with scikitlearn

I am curious to see if there are built-in functions in the scikitlearn python module that can fetch erroneously formatted documents.

It's simple, I usually write myself, comparing both predicted and test vectors and extracting documents from an array of test documents. but I am asking if there is a built in functionality for it and not for copying the functionality in every pit code I write.

+3


source to share


2 answers


If you have a list of true labels y_test

for a set of documents, eg. ["ham", "spam", "spam", "ham"]

and you convert that to a NumPy array, then you can compare it to the predictions in one layer:

import numpy as np

y_test = np.asarray(y_test)
misclassified = np.where(y_test != clf.predict(X_test))

      



Now misclassified

is an array of indices in X_test

.

@eickenberg is right, this kind of thing is not implemented in scikit-learn because users are expected to be familiar enough with NumPy to do it themselves in a few lines of code.

+7


source


You can get misclassifications like this with list comprehension. Otherwise, I don't know of any other way to do it in sklearn.



from sklearn.cross_validation import train_test_split
from sklearn import datasets
from sklearn import svm


iris = datasets.load_iris()
X_iris, y_iris = iris.data, iris.target
X, y = X_iris[:, :2], y_iris
X_train, X_test, y_train, y_test = train_test_split(X, y)

clf = svm.LinearSVC()
clf.fit(X_train, y_train)

mis_cls = [train 
           for test, truth, train in 
           zip(X_test, y_test, X_train) 
           if clf.predict(test) != truth]

      

0


source







All Articles