CountVectorizer () in scikit-learn Python gives memory error when feeding large dataset. The same code with Smaller dataset works great, what am I missing?

I am working on a machine learning problem with two classes. The training set contains 2 million URL strings (strings) and labels 0 and 1. The LogisticRegression () classifier must predict either of the two labels when sending test data. I get 95% precision results when I use a smaller dataset i.e. 78,000 URLs and 0 and 1 as labels.

The problem I am facing is when I download a large array of data (2 million URL strings) I get this error:

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile

execfile(filename, namespace)
File "C:/Users/Slim/.xy/startups/start/chi2-94.85 - Copy.py", line 48, in <module>

bi_counts = bi.fit_transform(url_list)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform

vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 717, in _count_vocab

j_indices.append(vocabulary[feature])
MemoryError

      

My code that works for small datasets with high enough precision is

bi = CountVectorizer(ngram_range=(3, 3),binary = True, max_features=9000, analyzer='char_wb')
bi_counts = bi.fit_transform(url_list)
tf = TfidfTransformer(norm='l2')
X_train_tf =tf.fit_transform(use_idf=True, bi_counts)
clf = LogisticRegression(penalty='l1',intercept_scaling=0.5,random_state=True)
clf.fit(train_x2,y)

      

I tried as little "max_features" as possible, at least say max_features = 100, but still the same result.

Note:

  • I am using an i5 core with 4GB of RAM
  • I have tried the same code on 8GB RAM but no luck
  • I am using Pyhon 2.7.6 with Sklearn, NumPy 1.8.1, SciPy 0.14.0, Matplotlib 1.3.1

UPDATE:

@ Andreas MΓΌller suggested using HashingVectorizer (), I used it with small and large datasets, 78,000 datasets were compiled successfully, but the 2 million dataset gave me the same memory error as shown above. I tried it on 8GB of RAM and in memory working space = 30% when compiling a large dataset.

+3


source to share


1 answer


IIRC max_features is only applied after calculating the entire dictionary. The easiest way out is to use HashingVectorizer

, which does not compute the dictionary. You will lose the ability to get the corresponding token for the function, but you should no longer trigger memory issues.



+3


source







All Articles