Efficient FDR correction using Benjamini-Hochberg memory using numpy / h5py

I am trying to compute a set of F-corrected p-values โ€‹โ€‹using Benjamini and Hochberg's method. However, the vector I'm trying to run contains over 10 billion values.

Given the amount of data, the usual method from the multi-complex statsmodel module runs out of memory. Looking at the source code for this function, it seems that it creates several vectors with a length of 10 billion in memory, which obviously won't work even on a machine with 100 GB of RAM.

Is there a way to do this, ideally, without having to store the entire vector in memory? In particular, I'm wondering if it's possible to re-embed BH in such a way that it can run on disk using h5py data structures.

Or any other suggestions?

+3


source to share


1 answer


In case someone else stumbles upon this:

The way I solved this was to first extract all p-values โ€‹โ€‹that had a chance to pass the FDR correction threshold (I used 1e-5). Memory consumption was not an issue for this as I could just iterate over the list of p-values โ€‹โ€‹on disk.



This gave me a set of smallest p-values โ€‹โ€‹of ~ 400k. I then manually applied the BH procedure to those p-values, but entered the original number of tests into the formula. Since BH is an incremental procedure, this is (as far as I know) the equivalent of applying BH to an entire vector, without requiring me to sort 10 billion values.

+2


source







All Articles