Efficient FDR correction using Benjamini-Hochberg memory using numpy / h5py

Question

Efficient FDR correction using Benjamini-Hochberg memory using numpy / h5py

I am trying to compute a set of F-corrected p-values using Benjamini and Hochberg's method. However, the vector I'm trying to run contains over 10 billion values.

Given the amount of data, the usual method from the multi-complex statsmodel module runs out of memory. Looking at the source code for this function, it seems that it creates several vectors with a length of 10 billion in memory, which obviously won't work even on a machine with 100 GB of RAM.

Is there a way to do this, ideally, without having to store the entire vector in memory? In particular, I'm wondering if it's possible to re-embed BH in such a way that it can run on disk using h5py data structures.

Or any other suggestions?

+3

python numpy statistics hdf5 h5py

Nils June 10. 15 at 18:20

source to share

1 answer

Nils · Accepted Answer · 2015-06-12T20:13:03+0000

In case someone else stumbles upon this:

The way I solved this was to first extract all p-values that had a chance to pass the FDR correction threshold (I used 1e-5). Memory consumption was not an issue for this as I could just iterate over the list of p-values on disk.

This gave me a set of smallest p-values of ~ 400k. I then manually applied the BH procedure to those p-values, but entered the original number of tests into the formula. Since BH is an incremental procedure, this is (as far as I know) the equivalent of applying BH to an entire vector, without requiring me to sort 10 billion values.

Efficient FDR correction using Benjamini-Hochberg memory using numpy / h5py

More articles: