Sampling histograms so that the sum over the sample is uniform

I have a list of items that I want to randomly select a subset of, but each item is paired with a histogram above the D cells, and I want to sample the items in such a way that the summed histogram is approximately the same.

So it should work like the example function below:

>>> import numpy
>>> #The histograms from which to sample (each having 5 bins):
>>> data = numpy.random.randint(100, size=(10000,5))
>>> #The function which I'm trying to program:
>>> samples = sample(data,500)
>>> samples.shape
(500,5)
>>> summed_histogram = samples.sum(axis=0)
>>> #Each bin should have approximately equal value
>>> summed_histogram / float(summed_histogram.sum())
array([ 0.2,  0.2,  0.2,  0.2,  0.2])

      

The absolute values ​​of the summed histogram are not important, and it should not be absolutely uniform, it should be approximately the same. Also, I don't care if the sample size returned is not exactly the specified sample size. The sample should be without replacement.

+3


source to share


2 answers


To expand on @Ilmari Karonen's solution, what you want to do is compute the weights for each histogram and then sample according to those weights. It seems to me that the most efficient way to do this, given your purpose, would be with a linear program .

Let D_ij be the weight of the j-th bin in the histogram of the i-th element. Then, if each item is weighted with a weight w_i, the "stacked histogram" will have a weighted sum (i in points) w_i D_ij. One way to get your "roughly uniform" distribution is to minimize the maximum difference between the boxes, so we would solve the following LP:

minimize z
subject to (for all j, k) 
    z >= (sum i in items) w_i D_ij - (sum i in items) w_i D_ik
    z >= (sum i in items) w_i D_ik - (sum i in items) w_i D_ij

      

The above basically says the z >=

absolute value of the difference in all weighted pairs of boxes. You will need a separate package to solve this LP as numpy does not include an LP solver. See this method for a solution using cplex

or this gist for a solution using cvxpy

. Note that you will need to set some limits on the weights (for example, each weight is greater than or equal to 0), as these solutions do. Other python bindings for GLPK (GNU Linear Programming Kit) can be found here: http://en.wikibooks.org/wiki/GLPK/Python .

Finally, you will simply cite the i

weight bar chart w_i

. This can be done with adapting the roulette selection using cumsum

and searchsorted

as suggested by @Ilmari Karonen see this method .

If you want the resulting weighted distribution to be "as uniform as possible," I would solve a similar problem with weights, but maximize the weighted entropy over the weighted sum of the bins. This problem would turn out to be non-linear, although you could use any number of non-linear solvers such as BFGS or gradient-based methods. This will probably be a little slower than the LP method, but it depends on what you need in your application. The LP method will be very close to the non-linear method if you have a large number of histograms, because it would be easy to achieve an even distribution.



When using the LP solution, a bunch of histogram weights can bind to 0 because the number of constraints is small, but this won't be a problem with a nontrivial number of boxes, since the number of constraints is O (n ^ 2).

An example of a weight with 50 histograms and 10 cells:

[0.006123642775837011, 0.08591660144140816, 0.0, 0.0, 0.0, 0.0, 0.03407525280610657, 0.0, 0.0, 0.0, 0.07092537493489116, 0.0, 0.0, 0.023926802333318554, 0.0, 0.03941537854267549, 0.0, 0.0, 0.0, 0.0, 0.10937063438351756, 0.08715770469631079, 0.0, 0.05841899435928017, 0.016328676622408153, 0.002218517959171183, 0.0, 0.0, 0.0, 0.08186919626269101, 0.03173286609277701, 0.08737065271898292, 0.0, 0.0, 0.041505225727435785, 0.05033635148761689, 0.0, 0.09172214842175723, 0.027548495513552738, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0259929997624099, 0.0, 0.0, 0.028044483157851748, 0.0, 0.0, 0.0]

      

With 50 histograms with 50 cells, there are now very few null values:

[0.0219136051655165, 0.0, 0.028325808078797768, 0.0, 0.040889043180965624, 0.04372501089775975, 0.0, 0.031032870504105477, 0.020745831040881676, 0.04794861828714149, 0.0, 0.03763592540998652, 0.0029093177405377577, 0.0034239051136138398, 0.0, 0.03079554151573207, 0.0, 0.04676278554085836, 0.0461258666541918, 9.639105313353352e-05, 0.0, 0.013649362063473166, 0.059168272186891635, 0.06703936360466661, 0.0, 0.0, 0.03175895249795131, 0.0, 0.0, 0.04376133487616099, 0.02406633433758186, 0.009724226721798858, 0.05058252335384487, 0.0, 0.0393763638188805, 0.05287112817101315, 0.0, 0.0, 0.06365320629437914, 0.0, 0.024978299494456246, 0.023531082497830605, 0.033406648550332804, 0.012693750980220679, 0.00274892002684083, 0.0, 0.0, 0.0, 0.0, 0.04465971034045478, 4.888224154453002]

      

+2


source


Could you draw some complete random samples (500) and then pick the most uniform (i.e. lowest sample.sum(axis=0).std()

)? This avoids strange offsets when drawing incremental selections.



0


source







All Articles