How to deal with huge sparse matrix plots with Scipy?

So I'm working on a dump on Wikipedia to compute pages containing about 5,700,000 pages. The files are preprocessed and therefore are not in XML.
They are taken from http://haselgrove.id.au/wikipedia.htm and the format is:

from_page(1): to(12) to(13) to(14)..
from_page(2): to(21) to(22)..
.
.
.
from_page(5,700,000): to(xy) to(xz)

      

etc. So. it's basically a die design [5,700,000*5,700,000]

that would just break 4 gigabytes of RAM. Since it is very, very sparse, it makes it easier to save with scipy.lil.sparse

or scipy.dok.sparse

, now my problem is:

How can I go about converting .txt

the link information file to a sparse matrix? Read it and compute it as a normal N * N matrix, then transform it or what? I have no idea.

Also, links sometimes propagate line by line, so what would be the right way to handle this?
eg: a random string is like ..

[
1: 2 3 5 64636 867
2:355 776 2342 676 232
3: 545 64646 234242 55455 141414 454545 43
4234 5545345 2423424545
4:454 6776
]

      

exactly like that: no commas and separators.

Any information about sparse matrix structure and line processing would be helpful.

+3


source to share


1 answer


Scipy offers several implementations of sparse matrices. Each of them has its own advantages and disadvantages. You can find information on matrix formats here :

There are several ways to get to the desired sparse matrix. Computing the full NxN matrix and then converting is probably not possible, due to high memory requirements (about 10 ^ 12 entries!).

In your case, I have prepared your data to build a coo_matrix .

coo_matrix((data, (i, j)), [shape=(M, N)])

data[:] the entries of the matrix, in any order
i[:] the row indices of the matrix entries
j[:] the column indices of the matrix entries

      



You can also take a look at lil_matrix , which can be used to build your matrix step by step.

Once you've created the matrix, you can convert it to a more suitable format for calculation, depending on your use case.

I don't recognize the data format, there might be parsers for it, maybe not. However, writing your own parser shouldn't be very difficult. Each line containing a colon starts a new row, all subscripts after the colon and consecutive lines without a colon are the column entries for the specified row.

+1


source







All Articles