Performing a grid lookup on sklearn.naive_bayes.MultinomialNB on a multicore machine does not use all available CPU resources

I am currently trying to create some text classification tools using Python and Scikit-learn.

My text is not available in English and therefore is not subject to the usual processing of stalk decomposition or other dimensionality reduction in English.

As a result, the TfIdf matrix becomes quite large (150,000x150,000). It can be processed using a regular PC, but running a grid search on them would be too much, so I enlisted the help of an Amazon web service to run grid search. (My parameter set is pretty big too)

Here is my code:

 # coding: utf-8  
    import os, json, codecs, nltk  
    import numpy as np  
    from sklearn.feature_extraction.text import TfidfVectorizer,  CountVectorizer,TfidfTransformer  
    from sklearn.grid_search import GridSearchCV  
    from time import time  
    from sklearn.pipeline import Pipeline  
    from sklearn.naive_bayes import MultinomialNB  
    print("Importing dataset...")  
    with open('y_data.json','r') as fp:  
        y = json.load(fp)  
    with open('dataset.json','r') as fp:  
        dataset = json.load(fp)  
    print("Importing stop words...")  
    with codecs.open('stopword.txt','r','utf-8') as fp:  
    stopword = []  
    for w in fp:  
        stopword.append(w.strip())  
    light_st = set(stopword)  
    with codecs.open('st_data.txt','r','cp874') as fp:  
    for w in fp:  
        stopword.append(w.strip())  
    heavy_st = set(stopword)  
    def pre_process_1(text):  
        return text.replace("|"," ")  
    def tokenize_1(text):  
        return text.split()  
    pipeline = Pipeline([('vec', CountVectorizer(encoding='cp874', preprocessor=pre_process_1, tokenizer=tokenize_1, stop_words=heavy_st, token_pattern=None)),('tfidf', TfidfTransformer()), ('clf',       MultinomialNB())])
    parameters = {  
    'vec__max_df': (0.5, 0.625, 0.75, 0.875, 1.0),  
    'vec__max_features': (None, 5000, 10000, 20000),  
    'vec__min_df': (1, 5, 10, 20, 50),  
    'tfidf__use_idf': (True, False),  
    'tfidf__sublinear_tf': (True, False),  
    'vec__binary': (True, False),  
    'tfidf__norm': ('l1', 'l2'),  
    'clf__alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001)  
    }  
    if __name__ == "__main__":  
        grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2)  
        t0 = time()  
        grid_search.fit(dataset, y)  
        print("done in {0}s".format(time() - t0))  
        print("Best score: {0}".format(grid_search.best_score_))  
        print("Best parameters set:")  
        best_parameters = grid_search.best_estimator_.get_params()  
        for param_name in sorted(list(parameters.keys())):  
            print("\t{0}: {1}".format(param_name, best_parameters[param_name]))

      

And here are the details of my programming environment:

  • Python3.4.2

  • scikit-learn 0.15.2 (installed with Pip)

  • Ubuntu Server14.04 LTS 64 bit (using HVM)

  • Tried an instance of ec2 r3.8xlarge

At first I run my model using a much smaller instance (r3.2xlarge; 8 cores), but I figured out from the calculation that it will take quite a long time (2 days). So I decided to grow my machine and use the largest instance (I use r3 because my script was quite memory intensive); however, it did not work as fast as I expected.

When I tried to monitor the cpu load (watch -n 5 uptime) ... I found out that the average cpu load never exceeds 9, even when I leave it on for some time. (From what I understand, thirty-two main machines, when they fully utilize all their kernels, should be around 32).

I tried to change

n_job

to different numbers (8, 32, 128) with the same result. (However, I think the script is trying to run as many jobs as instructed, because when I terminate the process, I would see something like ... "Process ForkPoolWorker-30:" and their traces float past screen)

Further checking with ps x -C python3.4 command allows only 8 python processes to start. I figured it might be some kind of limitation from python or OS (I am building my AMI using the t2.micro instance which doesn't have many cores). So, I decided to redo my job of rebuilding my environment from scratch, including compiling Python using c3.4xlarge, and changing the OS to Amazon Linux (a Fedora branch, I think) for better hardware compatibility.

However, my script still never exceeded 8 cores. Finally, using the textual text classification code on the Scikit-learn website: http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html (which use SGDClassifier instead of MultinomialNB). It can work fine with all thirty-two cores!

So ... maybe something to do with the grid search algorithm and the Naive Bayes classifier?

I am considering a bug but would like to know first if this is the expected behavior of Naive Bayes or if I am doing something with my code?

Update

I cannot find a way to check if the memory bandwidth was directly at fault. But I am trying to use my parallel code and CPU usage in various ways to find exactly where the bottleneck is happening.

Experiment 1: Perform vectorization and transformation only.

Using my real data as input (150,000 text documents, each with about 130 words)
The area of ​​the parameters is about 400. The
multithreading is done by Joblib (the same module Scikit-learn uses). I got:
Using 8 threads: done at 841.017783164978 s and using 24.636999999999993% of the CPU.
Using 16 threads: done in 842.9525656700134 s and using 24.700749999999985% of the CPU.
Using all 32 threads: done in 857.024197101593 s and using 24.242250000000013% of the CPU.

The result clearly indicates that the vectorization process cannot scale as the processing power increases.

Experiment 2: This time, I'm only doing multivaluedNB on pre-vectorized data.

Using a parameter space of about 400 as before, I got:
Using 8 threads: done in 2102.0565922260284 s and using 25.486000000000054% of the CPU.
Using 16 threads: done in 1385.6887295246124 s and using 49.83674999999993% of the CPU.
Using all 32 threads: done in 1319.416403055191 s and using 89.90074999999997% of the CPU.

Moving from 8 streams to 16 streams shows a huge improvement. However, as the number of threads increases to 32, the overall completion time becomes slightly shorter and the CPU usage increases significantly. I don't quite understand this point.

Experiment 3: I combine two processes together.

Using 8 threads: done in 3385.3253166675568 s and using 25.68999999999995% of the CPU.
Using 16 threads: done in 2066.499200105667 s and using 49.359249999999996% of cpu.
Using all 32 threads: done in 2018.8800330162048 s and using 54.55375000000004% of the CPU.

There is some difference between the time I got from my own parallel code and the GridsearchCV code, but that could be due to the simplification I did in my code (I am not doing cross-validation or full parameters iteration as in Gridsearch)

conclusions

From my test, I conclude this. (And please correct me if I'm wrong)

  • The vectorization phase uses memory more intensively; thus will likely saturate the bandwidth. This can be observed from the time the CPU is finished and used, that it is facing some kind of bottleneck and does not scale. However, this was a relatively quick process. (I am removing the IO binding since all data is stored in RAM and memory usage at this time is around 30%).
  • MultinomialNB uses less memory intensive than the vectorizer; most of the computation appears to be handled in the kernel. So it can scale better than the vectorizer (8> 16), but after that it also hits some sort of bottleneck and MultinomialNB takes longer than the vectorizer.
  • When the two processes are combined, the completion time shows the same trend as MultinomialNB, because in my opinion memory bandwidth may be a bottleneck during the vectorization phase, but this phase is relatively short compared to MultinomialNB. Thus, if the number of concurrent tasks is small, it is possible to combine these two phases simultaneously and not saturate the bandwidth, but when the number of processes is large enough, there will be a sufficient number of concurrent processes performing vectorization to saturate the bandwidth; thereby forcing the operating system to shrink the current process. (Explain only the 8-9 running python processes I found earlier)
  • I'm not very sure, but I think the reason SGDClassifier can use 100% CPU is because SGDClassifier has much more processing time in the kernel than MultinomialNB. Thus, at each iteration, most of the time is devoted to computing the SGDClassifier in the kernel, rather than vectorization, and the fact that SGDClassifier takes a long time to compute reduces the likelihood that many workers will get to the vectorization stage at the same time (since each vectorization task is relatively short but memory intensive)

I guess the best thing to do now is to go for cluster computing. :)

+3


source to share


1 answer


It looks like your assignments are related to memory.

Naive Bayes is an extremely simple model and its training algorithms consist of one (sparse) matrix multiplication and multiple sums. Likewise, tf-idf is very easy to compute: it sums up its inputs, computes multiple logs, and stores the result.



In fact, NB is so simple that the bottleneck of this program is almost certainly in CountVectorizer

, which transforms the data structures in memory several times until it reaches its entire counter of overflowed terms into the right matrix format. You are likely to run into a memory bandwidth bottleneck if you do this in parallel.

(This is all educated guesswork, but this is based on my involvement in scikit-learning development. I am one of the authors MultinomialNB

and one of many people who have hacked CountVectorizer

to speed it up.)

+4


source







All Articles