Python Pandas: compare values ββof two dataframes with different column names by id or row
I have two dataframes (call them M and K) that come from different sources. They have different column names, and one column that is the same in both data frames is the ID (M [id] == K [id]) column.
The number of lines in both data frames is equal; the number of columns is different.
The goal is to create a matrix that says how many columns have the same values ββfor the same ID (or row). Matrix size (MK) is M. columns X K. columns. Each cell is a count of the number of matching values ββfor the M.column and K. column pair. The maximum number in a cell is the number of rows for M or K, since they are the same. Missing values ββ(NaN) should be ignored.
Let the numbers speak =)
data_M = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'm1': ['a', 'b', 'c', 'd', 'e', 2],
'm2': [1, 2, 3, 4, np.nan, 1],
'm3': ['aa','b','cc','d','ff', 3],
'm4': [4, 6, 3, 4, np.nan, 2],
'm5': ['b', 6, 'a', 4, np.nan, 1],
}
data_K = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'k1': ['z', 'bb', 'c', 'd', 'e', 4],
'k2': [1, 2, 32, 5, np.nan, 1],
'k3': ['aa','b','cc','d','ff', 1],
'k4': [4, 2, 2, 4, np.nan, 4],
'k5': [4, 1, 'as', 4, np.nan, 2],
'k6': ['aa', 1, 'a', 3, np.nan, 2],
}
M = pd.DataFrame(data_M, columns = ['id','m1','m2','m3','m4','m5'])
K = pd.DataFrame(data_K, columns = ['id','k1','k2','k3','k4', 'k5','k6'])
Output M and K
M
Out[2]:
id m1 m2 m3 m4 m5
0 id1 a 1.0 aa 4.0 b
1 id2 b 2.0 b 6.0 6
2 id3 c 3.0 cc 3.0 a
3 id4 d 4.0 d 4.0 4
4 id5 e NaN ff NaN NaN
5 id6 2 1.0 3 2.0 1
K
Out[3]:
id k1 k2 k3 k4 k5 k6
0 id1 z 1.0 aa 4.0 4 aa
1 id2 bb 2.0 b 2.0 1 1
2 id3 c 32.0 cc 2.0 as a
3 id4 d 5.0 d 4.0 4 3
4 id5 e NaN ff NaN NaN NaN
5 id6 4 1.0 1 4.0 2 2
In the first comparison for id == 'id1', the MK matrix should look something like this:
id m1 m2 m3 m4 m5
id 1 0 0 0 0 0
k1 0 0 0 0 0 0
k2 0 0 1 0 0 0
k3 0 0 0 1 0 0
k4 0 0 0 0 1 0
k5 0 0 0 0 1 0
k6 0 0 0 1 0 0
On the second one (id == 'id2') it should be like this:
id m1 m2 m3 m4 m5
id 2 0 0 0 0 0
k1 0 0 0 0 0 0
k2 0 0 2 0 0 0
k3 0 0 0 2 0 0
k4 0 0 1 0 1 0
k5 0 0 0 0 1 0
k6 0 0 0 1 0 0
At the very end, each cell will be converted to a percentage of the matched values.
And last. In theory, this could be more than one line for each identifier. However, this does not apply to the current problem. But if you have inspiration, you can solve the "general case" ^ _ ^
Many thanks.
source to share
Broadcast approach numpy
andpd.Panel
m = M.values[:, 1:] k = K.values[:, 1:] p = pd.Panel( (m[:, None] == k[:, :, None]).astype(np.uint8), M.id.values, K.columns[1:], M.columns[1:])
then access for each id
p['id1']
m1 m2 m3 m4 m5
k1 0 0 0 0 0
k2 0 1 0 0 0
k3 0 0 1 0 0
k4 0 0 0 1 0
k5 0 0 0 1 0
k6 0 0 1 0 0
Or using pandas
groupby
df = M.set_index('id').join(K.set_index('id'))
def row_comp(r):
m = r.filter(like='m')
k = r.filter(like='k')
return pd.DataFrame(
(m.values == k.values.T).astype(np.uint8),
k.columns, m.columns
)
df.groupby(level=0).apply(row_comp)
m1 m2 m3 m4 m5
id
id1 k1 0 0 0 0 0
k2 0 1 0 0 0
k3 0 0 1 0 0
k4 0 0 0 1 0
k5 0 0 0 1 0
k6 0 0 1 0 0
id2 k1 0 0 0 0 0
k2 0 1 0 0 0
k3 1 0 1 0 0
k4 0 1 0 0 0
k5 0 0 0 0 0
k6 0 0 0 0 0
id3 k1 1 0 0 0 0
k2 0 0 0 0 0
k3 0 0 1 0 0
k4 0 0 0 0 0
k5 0 0 0 0 0
k6 0 0 0 0 1
id4 k1 1 0 1 0 0
k2 0 0 0 0 0
k3 1 0 1 0 0
k4 0 1 0 1 1
k5 0 1 0 1 1
k6 0 0 0 0 0
id5 k1 1 0 0 0 0
k2 0 0 0 0 0
k3 0 0 1 0 0
k4 0 0 0 0 0
k5 0 0 0 0 1
k6 0 0 0 0 1
id6 k1 0 0 0 0 0
k2 0 1 0 0 1
k3 0 1 0 0 1
k4 0 0 0 0 0
k5 1 0 0 1 0
k6 1 0 0 1 0
source to share