How to efficiently compute huge matrix multiplication (tfidf functions) in Python?

I currently want to calculate all-pair document similarity using cosine similarity and Tfidf functions in python. My main approach is this:

from sklearn.feature_extraction.text import TfidfVectorizer
#c = [doc1, doc2, ..., docn]
vec = TfidfVectorizer()
X = vec.fit_transform(c)
del vec
Y = X * X.T

      

Works great, but unfortunately not for my very large datasets. X has dimension (350363, 2526183)

and therefore the output matrix Y must have (350363, 350363)

. X is very rare due to the nature of tfidf and therefore fits easily into memory (only about 2GB). However, the multiplication is giving me a memory error after running for some time (although the memory is not full, but I suppose scipy is smart enough to expect memory usage).

I have already tried playing with dtypes without any success. I also made sure that numpy and scipy are linked to their BLAS libraries, whereas this does not affect the functionality of the csr_matrix point since it is implemented in C. I thought maybe using things like memmap, but I'm not sure if ...

Does anyone have an idea of ​​how best to approach this?

+3


source to share


3 answers


You can have a look at the module random_projection

in scikit-learn. The Johnson-Lindenstrauss lemma says that the random prediction matrix is ​​guaranteed to preserve pairwise distances up to some tolerance eta

, which is a hyperparameter when you compute the number of random predictions needed.

To cut the long story short, the scikit-learn class SparseRandomProjection

seen here is a transformer that does it for you. If you run it after the X after vec.fit_transform

, you will see a rather significant reduction in the size of the function.

The formula from sklearn.random_projection.johnson_lindenstrauss_min_dim

shows that you only need functions johnson_lindenstrauss_min_dim(350363, .1)

10942 to maintain up to a 10% tolerance . This is an upper bound, so you can go far less. Even a 1% tolerance would require johnson_lindenstrauss_min_dim(350363, .01)

1,028,192 features, which is still significantly less than what you currently have.



EDIT: Just try it - if your data is dtype = 'float64' try using 'float32'. That alone can save a lot of space, especially if you don't need double precision.

If the problem is that you also cannot store the "final matrix" in memory, then I would recommend working with the data in HDF5Store (as shown in pandas using pytables). This link has good starter code and you can iteratively calculate chunks of your product-point and write to disk. I used this extensively on a recent project on a 45gb dataset and could be of great help if you decide to go this route.

+4


source


Even if it probably won't be X

sparse, X * X.T

note that it only needs one non-zero common element in a given pair of lines. You are working on an NLP problem, so I'm sure there is a huge amount of words that appear in almost all documents (and as said earlier, it shouldn't be one for everything , but one (possibly different) for each pair.) you end up with a matrix of elements 350363^2

that has about 122,000,000,000 elements, if you don't have 200GB of ram it doesn't look computable.Try doing much more aggressive word filtering to force t21 to be sparse (remove a lot of common words)



In general, you cannot compute a big data Gram matrix unless you apply sparsity X * X.T

, so most of your vector (document) pairs have 0 "affinity". This can be done in different ways, the easiest way is to set the threshold T

under which you will process <a,b>

how 0

, and calculate the point product yourself, and create a record in the resulting sparse matrix iff<a,b> > T

+6


source


What you can do is slice the row and column X, multiply them, and save the resulting row to a file. Then move on to the next row and column.

It's still the same amount of computation, but you would not run out of memory.

By using multiprocessing.Pool.map()

or multiprocessing.Pool.map_async()

, you can speed it up quickly if you are using numpy.memmap()

to read a matrix in a displayed function. And you will probably have to write each of the calculated lines to a separate file to concatenate them later. If you want to return a string from a displayed function, it will need to be returned back to the original process. This will require a lot of memory and IPC bandwidth.

+1


source







All Articles