CountVectorizer () in scikit-learn Python gives memory error when feeding large dataset. The same code with Smaller dataset works great, what am I missing?

Question

CountVectorizer () in scikit-learn Python gives memory error when feeding large dataset. The same code with Smaller dataset works great, what am I missing?

I am working on a machine learning problem with two classes. The training set contains 2 million URL strings (strings) and labels 0 and 1. The LogisticRegression () classifier must predict either of the two labels when sending test data. I get 95% precision results when I use a smaller dataset i.e. 78,000 URLs and 0 and 1 as labels.

The problem I am facing is when I download a large array of data (2 million URL strings) I get this error:

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile

execfile(filename, namespace)
File "C:/Users/Slim/.xy/startups/start/chi2-94.85 - Copy.py", line 48, in <module>

bi_counts = bi.fit_transform(url_list)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform

vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 717, in _count_vocab

j_indices.append(vocabulary[feature])
MemoryError

My code that works for small datasets with high enough precision is

bi = CountVectorizer(ngram_range=(3, 3),binary = True, max_features=9000, analyzer='char_wb')
bi_counts = bi.fit_transform(url_list)
tf = TfidfTransformer(norm='l2')
X_train_tf =tf.fit_transform(use_idf=True, bi_counts)
clf = LogisticRegression(penalty='l1',intercept_scaling=0.5,random_state=True)
clf.fit(train_x2,y)

I tried as little "max_features" as possible, at least say max_features = 100, but still the same result.

Note:

I am using an i5 core with 4GB of RAM
I have tried the same code on 8GB RAM but no luck
I am using Pyhon 2.7.6 with Sklearn, NumPy 1.8.1, SciPy 0.14.0, Matplotlib 1.3.1

UPDATE:

@ Andreas Müller suggested using HashingVectorizer (), I used it with small and large datasets, 78,000 datasets were compiled successfully, but the 2 million dataset gave me the same memory error as shown above. I tried it on 8GB of RAM and in memory working space = 30% when compiling a large dataset.

+3

python numpy scikit-learn machine-learning feature-extraction

Nasir 20 nov. 14 at 18:23

source to share

1 answer

Andreas Mueller · Accepted Answer · 2014-11-20T20:05:13+0000

IIRC max_features is only applied after calculating the entire dictionary. The easiest way out is to use HashingVectorizer

, which does not compute the dictionary. You will lose the ability to get the corresponding token for the function, but you should no longer trigger memory issues.

CountVectorizer () in scikit-learn Python gives memory error when feeding large dataset. The same code with Smaller dataset works great, what am I missing?

More articles: