Best way to get joint probability from 2D numpy

I wonder if there is a better way to get the probability of a numpy two dimensional array. Perhaps using some built-in numpy functions.

For simplicity, let's say we have an example of an array:

[['apple','pie'],
['apple','juice'],
['orange','pie'],
['strawberry','cream'],
['strawberry','candy']]

      

I would like to get such a probability as:

['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie']  --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2

      

Where "juice" as the second word has a probability of 0.2. Since the apple has a 2/5 * 1/2 chance (for juice).

On the other hand, pie as the second word has a probability of 0.4. The combination of probability from "apple" and "orange".

As I approached the problem, 3 new columns were added to the array, for the 1st column probability, 2nd column and final probability. Group the array by 1st column, then 2nd column, and update the probability accordingly.

Below is my code:

a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])

ans = []
unique, counts = np.unique(a.T[0], return_counts=True)                      ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)]       ## ADD 3 COLUMNS to a

groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
    index = a[:,0] == _unique                                               ## WHERE COLUMN 0 MATCH _unique
    curr_a = a[index]
    for j in range(len(curr_a)):
        curr_a[j][2] = _count/num_rows
    groups.append(curr_a)

## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
    unique, counts = np.unique(g.T[1], return_counts=True)
    myCounter = zip(unique, counts)
    num_rows = sum(counts)

    for _unique, _count in myCounter:
        index = g[:, 1] == _unique
        curr_g = g[index]
        for j in range(len(curr_g)):
            curr_g[j][3] = _count / num_rows
            curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3])        ## COMPUTE FINAL PROBABILITY
        ans.append(curr_g[j])

for an in ans:
    print(an)

      

Outputs:

['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']

      

I wonder if there is a shorter or faster way to do this with numpy or other means. Adding columns is not required, this was my way of doing it. Another approach would be acceptable.

+3


source to share


1 answer


Based on the definition of the probability distribution you provided, you can use pandas

to accomplish the same ie

import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])

df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion 
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities 
df[4]= df[2]*df[3]

      

Output:

            0 1 2 3 4
0 apple pie 0.4 0.5 0.2
1 apple juice 0.4 0.5 0.2
2 orange pie 0.2 1.0 0.2
3 strawberry cream 0.4 0.5 0.2
4 strawberry candy 0.4 0.5 0.2

If you want a list view you can use df.values.tolist()

If you don't need columns then



df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))

      

Output:

           0 1 2
0 apple pie 0.2
1 apple juice 0.2
2 orange pie 0.2
3 strawberry cream 0.2
4 strawberry candy 0.2

For the general probability print(df.groupby(1)[2].sum())

candy 0.2
cream 0.2
juice 0.2
pie 0.4
+1


source







All Articles