How to efficiently compute huge matrix multiplication (tfidf functions) in Python?
I currently want to calculate all-pair document similarity using cosine similarity and Tfidf functions in python. My main approach is this:
from sklearn.feature_extraction.text import TfidfVectorizer
#c = [doc1, doc2, ..., docn]
vec = TfidfVectorizer()
X = vec.fit_transform(c)
del vec
Y = X * X.T
Works great, but unfortunately not for my very large datasets. X has dimension (350363, 2526183)
and therefore the output matrix Y must have (350363, 350363)
. X is very rare due to the nature of tfidf and therefore fits easily into memory (only about 2GB). However, the multiplication is giving me a memory error after running for some time (although the memory is not full, but I suppose scipy is smart enough to expect memory usage).
I have already tried playing with dtypes without any success. I also made sure that numpy and scipy are linked to their BLAS libraries, whereas this does not affect the functionality of the csr_matrix point since it is implemented in C. I thought maybe using things like memmap, but I'm not sure if ...
Does anyone have an idea of how best to approach this?
source to share
You can have a look at the module random_projection
in scikit-learn. The Johnson-Lindenstrauss lemma says that the random prediction matrix is guaranteed to preserve pairwise distances up to some tolerance eta
, which is a hyperparameter when you compute the number of random predictions needed.
To cut the long story short, the scikit-learn class SparseRandomProjection
seen here is a transformer that does it for you. If you run it after the X after vec.fit_transform
, you will see a rather significant reduction in the size of the function.
The formula from sklearn.random_projection.johnson_lindenstrauss_min_dim
shows that you only need functions johnson_lindenstrauss_min_dim(350363, .1)
10942 to maintain up to a 10% tolerance . This is an upper bound, so you can go far less. Even a 1% tolerance would require johnson_lindenstrauss_min_dim(350363, .01)
1,028,192 features, which is still significantly less than what you currently have.
EDIT: Just try it - if your data is dtype = 'float64' try using 'float32'. That alone can save a lot of space, especially if you don't need double precision.
If the problem is that you also cannot store the "final matrix" in memory, then I would recommend working with the data in HDF5Store (as shown in pandas using pytables). This link has good starter code and you can iteratively calculate chunks of your product-point and write to disk. I used this extensively on a recent project on a 45gb dataset and could be of great help if you decide to go this route.
source to share
Even if it probably won't be X
sparse, X * X.T
note that it only needs one non-zero common element in a given pair of lines. You are working on an NLP problem, so I'm sure there is a huge amount of words that appear in almost all documents (and as said earlier, it shouldn't be one for everything , but one (possibly different) for each pair.) you end up with a matrix of elements 350363^2
that has about 122,000,000,000 elements, if you don't have 200GB of ram it doesn't look computable.Try doing much more aggressive word filtering to force t21 to be sparse (remove a lot of common words)
In general, you cannot compute a big data Gram matrix unless you apply sparsity X * X.T
, so most of your vector (document) pairs have 0 "affinity". This can be done in different ways, the easiest way is to set the threshold T
under which you will process <a,b>
how 0
, and calculate the point product yourself, and create a record in the resulting sparse matrix iff<a,b> > T
source to share
What you can do is slice the row and column X, multiply them, and save the resulting row to a file. Then move on to the next row and column.
It's still the same amount of computation, but you would not run out of memory.
By using multiprocessing.Pool.map()
or multiprocessing.Pool.map_async()
, you can speed it up quickly if you are using numpy.memmap()
to read a matrix in a displayed function. And you will probably have to write each of the calculated lines to a separate file to concatenate them later. If you want to return a string from a displayed function, it will need to be returned back to the original process. This will require a lot of memory and IPC bandwidth.
source to share