Python - KL divergence on numpy arrays with different lengths

Question

Python - KL divergence on numpy arrays with different lengths

I am using SciPy's KL divergence implementation ([ http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html] ) for two different numpy arrays.

The first, say, "base_freq" has a standard length of 2000. The second, the length of "test_freq" can take different values depending on the sample. So let's say its length is 8000.

How can I calculate the KL divergence when the two are not the same length ???

My thought was to split the second array ("test_freq") into multiple arrays with length 2000. But how is this done? And what happens when "test_freq" receives a sample of length 250?

+3

python arrays numpy scipy

Iolkos 09 June 15 at 21:02

source to share

2 answers

Rabbit · Answer 1 · 2016-04-21T20:24:27+0000

Disclaimer: I am not a statistician.

KL-Divergence is a measure of the probability distribution. This means that you must make sure that the inputs for your entropy function are two valid probability distributions from the same sample space.

In your case, you have a finite number of possible values, so you have a discrete random variable. This also means that each outcome of your variable can be measured as the frequency of occurrence over a series of tests.

Let me give you a simple example. Let's say your random variable represents imperfect bones that have 6 possible outcomes (6 sides). You roll the dice 100 times.

Imagine you got the following blueprint distribution:

1: 10 times
2: 12 times
3: 08 times
4: 30 times
5: 20 times
6: 20 times

Since each outcome (side) has happened multiple times, you just need to divide the number of outcomes by 100. This is your frequency, which is also the probability.

So now we have:

P(side=1) = 10/100 = .10
P(side=2) = 12/100 = .12
P(side=3) = 08/100 = .08
P(side=4) = 30/100 = .30
P(side=5) = 20/100 = .20
P(side=6) = 20/100 = .20

And finally, here's your probability distribution:

[.10, .12, .08, .30, .20, .20]

Note that the sum sums up to 1, as expected from the probability distribution.

If you do the second experiment and come up with a different probability distribution, it will still have 6 probabilities, even if your sample count is not 100 this time.

That's all to say that there is no point in comparing two probability distributions from different sample spaces. If you have a way to convert from sample space to another, that would be possible. However, make sure that the probability distributions are representations from the same sample space. It doesn't make sense to compare the probabilities of 6-sided dice and 8-sided dice, because they don't represent the same thing.

rjonnal · Answer 2 · 2015-06-09T21:48:01+0000

I need to preface by saying that I am not an information theory specialist. For one application in which I used KL divergence, I compared two images pixel by pixel to calculate the number of bits lost. If the images are of different sizes, your suggested approach would require that for each pixel in the smaller image, I select the corresponding pixel in the larger one - not any old pixel. I understand that the KL discrepancy only makes sense if you are comparing two signals sampled identically (i.e. the same time or space sampling interval).

If you want to do what you suggest, you can use numpy.random.choice

:

import numpy as np

def uneven_kl_divergence(pk,qk):
    if len(pk)>len(qk):
        pk = np.random.choice(pk,len(qk))
    elif len(qk)>len(pk):
        qk = np.random.choice(qk,len(pk))
    return np.sum(pk * np.log(pk/qk))

Python - KL divergence on numpy arrays with different lengths

More articles: