Efficient FDR correction using Benjamini-Hochberg memory using numpy / h5py
I am trying to compute a set of F-corrected p-values โโusing Benjamini and Hochberg's method. However, the vector I'm trying to run contains over 10 billion values.
Given the amount of data, the usual method from the multi-complex statsmodel module runs out of memory. Looking at the source code for this function, it seems that it creates several vectors with a length of 10 billion in memory, which obviously won't work even on a machine with 100 GB of RAM.
Is there a way to do this, ideally, without having to store the entire vector in memory? In particular, I'm wondering if it's possible to re-embed BH in such a way that it can run on disk using h5py data structures.
Or any other suggestions?
source to share
In case someone else stumbles upon this:
The way I solved this was to first extract all p-values โโthat had a chance to pass the FDR correction threshold (I used 1e-5). Memory consumption was not an issue for this as I could just iterate over the list of p-values โโon disk.
This gave me a set of smallest p-values โโof ~ 400k. I then manually applied the BH procedure to those p-values, but entered the original number of tests into the formula. Since BH is an incremental procedure, this is (as far as I know) the equivalent of applying BH to an entire vector, without requiring me to sort 10 billion values.
source to share