How to deal with huge sparse matrix plots with Scipy?
So I'm working on a dump on Wikipedia to compute pages containing about 5,700,000 pages. The files are preprocessed and therefore are not in XML.
They are taken from http://haselgrove.id.au/wikipedia.htm
and the format is:
from_page(1): to(12) to(13) to(14)..
from_page(2): to(21) to(22)..
.
.
.
from_page(5,700,000): to(xy) to(xz)
etc. So. it's basically a die design [5,700,000*5,700,000]
that would just break 4 gigabytes of RAM. Since it is very, very sparse, it makes it easier to save with scipy.lil.sparse
or scipy.dok.sparse
, now my problem is:
How can I go about converting .txt
the link information file to a sparse matrix? Read it and compute it as a normal N * N matrix, then transform it or what? I have no idea.
Also, links sometimes propagate line by line, so what would be the right way to handle this?
eg: a random string is like ..
[
1: 2 3 5 64636 867
2:355 776 2342 676 232
3: 545 64646 234242 55455 141414 454545 43
4234 5545345 2423424545
4:454 6776
]
exactly like that: no commas and separators.
Any information about sparse matrix structure and line processing would be helpful.
source to share
Scipy offers several implementations of sparse matrices. Each of them has its own advantages and disadvantages. You can find information on matrix formats here :
There are several ways to get to the desired sparse matrix. Computing the full NxN matrix and then converting is probably not possible, due to high memory requirements (about 10 ^ 12 entries!).
In your case, I have prepared your data to build a coo_matrix .
coo_matrix((data, (i, j)), [shape=(M, N)])
data[:] the entries of the matrix, in any order
i[:] the row indices of the matrix entries
j[:] the column indices of the matrix entries
You can also take a look at lil_matrix , which can be used to build your matrix step by step.
Once you've created the matrix, you can convert it to a more suitable format for calculation, depending on your use case.
I don't recognize the data format, there might be parsers for it, maybe not. However, writing your own parser shouldn't be very difficult. Each line containing a colon starts a new row, all subscripts after the colon and consecutive lines without a colon are the column entries for the specified row.
source to share