Can't get NLTK-Trainer to recognize / work with scikit-learn classifiers

Question

Can't get NLTK-Trainer to recognize / work with scikit-learn classifiers

I used the (excellent) NLTK-Trainer to train the NaiveBayes classifier to classify chunks of text. I see that NLTK-Trainer also supports scikit-learn algorithms and I would like to use them in the hopes of reducing memory usage / improving accuracy.

However, when I try to point one of the scikit-learn classifiers when running train_classifier.py, it throws an error:

train_classifier.py: error: argument --classifier/--algorithm: invalid choice: 'sklearn.BernoulliNB' (choose from 'NaiveBayes', 'DecisionTree', 'Maxent', 'GIS', 'IIS', 'MEGAM', 'TADM')

I am running 32-bit Anaconda distribution (2.20) Python 3.4.3 on Windows 7. "pip freeze" gives me this: NLTK 3.0.4, scikit-learn 0.16.1. I believe I am using the latest version of NLTK-Trainer (I downloaded it a month ago).

After some research, I have two theories about what is going wrong: 1. There is some error parsing the arg argument that does not pass the correct value for -classifier sklearn.BernoulliNB to train_classifer.py. After I do a trace on the error it gives me this

nltk_data\nltk-trainer-master\nltk-trainer-master\train_classifier.py in <module>() 131 nltk_trainer.classification.args.add_sklearn_args(parser) 132 --> 133 args = parser.parse_args() 134 AppData\Local\Continuum\Anaconda3\lib\argparse.py in parse_args(self, args, namespace) 1726 # ===================================== 1727 def parse_args(self, args=None, namespace=None): -> 1728 args, argv = self.parse_known_args(args, namespace) 1729 if argv: 1730 msg = _('unrecognized arguments: %s') 1765 except ArgumentError: 1766 err = _sys.exc_info()[1] -> 1767 self.error(str(err)) 1768 1769 def _parse_known_args(self, arg_strings, namespace):

My other hypothesis is that the scikit-learn files that were included in Anaconda are where NLTK-Trainer cannot find them. Per Jacob Perkins' here ( comment ) I can run the command 'from nltk.classify import scikitlearn' without errors. However, when I look further into the code for nltk-trainer / args.py here ( code ), I cannot run the code following the import command. "All of these lines cause errors.

from sklearn.feature_extraction.text import TfidfTransformer from sklearn.pipeline import Pipeline from sklearn import ensemble, feature_selection, linear_model, naive_bayes, neighbors, svm, tree

It was very frustrating and I can't tell why it doesn't work. Any help would be much appreciated!

+3

python nltk argparse

cars0245 05 Aug 15 at 14:58

source to share

1 answer

hpaulj · Answer 1 · 2015-08-05T15:32:10+0000

argparse

is just code that takes your command line arguments and parses them. He does not use or act on these arguments. This is done using the following code. The parser is just a gatekeeper, making sure your inputs look correct.

I'm not familiar with NLTK-Trainer

, but I can see what its parser does.

It is clear from the error message that your "sklearn.BernoulliNB" argument is passing. But the argument --classifier

was configured to only accept one of the strings in the list choices

. ['NaiveBayes', 'DecisionTree',...]

... It does not accept any name or module reference.

The program is likely to take the accepted name and map it to some other function, module, or parameter.

Try calling this code with -h

or --help

to see what arguments it uses. And go to the documentation for the program to see what it says about input. Perhaps there is another way to specify alternative algorithms. --classifier

is explicitly configured to accept only a predefined set of values.

Can't get NLTK-Trainer to recognize / work with scikit-learn classifiers

More articles: