Best way to get joint probability from 2D numpy
I wonder if there is a better way to get the probability of a numpy two dimensional array. Perhaps using some built-in numpy functions.
For simplicity, let's say we have an example of an array:
[['apple','pie'], ['apple','juice'], ['orange','pie'], ['strawberry','cream'], ['strawberry','candy']]
I would like to get such a probability as:
['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie'] --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2
Where "juice" as the second word has a probability of 0.2. Since the apple has a 2/5 * 1/2 chance (for juice).
On the other hand, pie as the second word has a probability of 0.4. The combination of probability from "apple" and "orange".
As I approached the problem, 3 new columns were added to the array, for the 1st column probability, 2nd column and final probability. Group the array by 1st column, then 2nd column, and update the probability accordingly.
Below is my code:
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
ans = []
unique, counts = np.unique(a.T[0], return_counts=True) ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)] ## ADD 3 COLUMNS to a
groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
index = a[:,0] == _unique ## WHERE COLUMN 0 MATCH _unique
curr_a = a[index]
for j in range(len(curr_a)):
curr_a[j][2] = _count/num_rows
groups.append(curr_a)
## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
unique, counts = np.unique(g.T[1], return_counts=True)
myCounter = zip(unique, counts)
num_rows = sum(counts)
for _unique, _count in myCounter:
index = g[:, 1] == _unique
curr_g = g[index]
for j in range(len(curr_g)):
curr_g[j][3] = _count / num_rows
curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3]) ## COMPUTE FINAL PROBABILITY
ans.append(curr_g[j])
for an in ans:
print(an)
Outputs:
['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']
I wonder if there is a shorter or faster way to do this with numpy or other means. Adding columns is not required, this was my way of doing it. Another approach would be acceptable.
source to share
Based on the definition of the probability distribution you provided, you can use pandas
to accomplish the same ie
import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities
df[4]= df[2]*df[3]
Output:
0 1 2 3 4 0 apple pie 0.4 0.5 0.2 1 apple juice 0.4 0.5 0.2 2 orange pie 0.2 1.0 0.2 3 strawberry cream 0.4 0.5 0.2 4 strawberry candy 0.4 0.5 0.2
If you want a list view you can use df.values.tolist()
If you don't need columns then
df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))
Output:
0 1 2 0 apple pie 0.2 1 apple juice 0.2 2 orange pie 0.2 3 strawberry cream 0.2 4 strawberry candy 0.2
For the general probability print(df.groupby(1)[2].sum())
candy 0.2 cream 0.2 juice 0.2 pie 0.4
source to share