How do I compare the aggregated parts of a pandas Dataframe?
Is it possible to compare parts of columns in a pandas Dataframe? I have the following example Dataframe: 4 languages ββare stored in it (en, de, nl, ua) and each language should have the same keys / same number of keys but with different values ββ(leaving a static column there to complete, since I have there is a static column whose values ββalways remain the same).
static β langs β keys β values
x β en β key_1 β value_en_1
x β en β key_2 β value_en_2
x β en β key_3 β value_en_3
x β de β key_1 β value_de_1
x β de β key_2 β value_de_2
x β de β key_3 β value_de_3
x β nl β key_1 β value_nl_1
x β nl β key_2 β value_nl_2
x β ua β key_1 β value_ua_1
I need to check which keys and how many are missing in one language versus English (here en), so something like this would be the desired result:
β Lang β Static β # Missing β Keys β
β de β x β 0 β β
β nl β x β 1 β key_3 β
β ua β x β 2 β key_2, key_3 β
This is my current progress:
import pandas as pd
# this is read from a csv, but I'll leave it as list of lists for simplicity
rows = [
['x', 'en', 'key_1', 'value_en_1'],
['x', 'en', 'key_2', 'value_en_2'],
['x', 'en', 'key_3', 'value_en_3'],
['x', 'de', 'key_1', 'value_de_1'],
['x', 'de', 'key_2', 'value_de_2'],
['x', 'de', 'key_3', 'value_de_3'],
['x', 'nl', 'key_1', 'value_nl_1'],
['x', 'nl', 'key_2', 'value_nl_2'],
['x', 'ua', 'key_1', 'value_en_1']
]
# create DataFrame out of rows of data
df = pd.DataFrame(rows, columns=["static", "language", "keys", "values"])
# print out DataFrame
print("Dataframe: ", df)
# first group by language and the static column
df_grp = df.groupby(["static", "language"])
# try to sum the number of keys and values per each language
df_summ = df_grp.agg(["count"])
# print out the sums
print()
print(df_summ)
# how to compare?
# how to get the keys?
This is the output from df_summ:
keys values
count count
static language
x de 3 3
en 3 3
nl 2 2
ua 1 1
At this point, I don't know how to proceed. I am grateful for any help / advice.
PS This is in Python 3.5.
source to share
EDIT:
#get set per groups by static and language
a = df.groupby(["static",'language'])['keys'].apply(set).reset_index()
#filter only en language per group by static and create set
b = df[df['language'] == 'en'].groupby("static")['keys'].apply(set)
#subtract mapped set b and join
c = (a['static'].map(b) - a['keys']).str.join(', ').rename('Keys')
#substract lengths
m = (a['static'].map(b).str.len() - a['keys'].str.len()).rename('Missing')
df = pd.concat([a[['static','language']], m, c], axis=1)
print (df)
static language Missing Keys
0 x de 0
1 x en 0
2 x nl 1 key_3
3 x ua 2 key_3, key_2
EDIT:
I am trying to change data:
rows = [
['x', 'en', 'key_1', 'value_en_1'],
['x', 'en', 'key_2', 'value_en_2'],
['x', 'en', 'key_3', 'value_en_3'],
['x', 'de', 'key_1', 'value_de_1'],
['x', 'de', 'key_2', 'value_de_2'],
['x', 'de', 'key_3', 'value_de_3'],
['x', 'nl', 'key_1', 'value_nl_1'],
['x', 'nl', 'key_2', 'value_nl_2'],
['x', 'ua', 'key_1', 'value_en_1'],
['y', 'en', 'key_1', 'value_en_1'],
['y', 'en', 'key_2', 'value_en_2'],
['y', 'de', 'key_4', 'value_en_3'],
['y', 'de', 'key_1', 'value_de_1'],
['y', 'de', 'key_2', 'value_de_2'],
['y', 'de', 'key_3', 'value_de_3'],
['y', 'de', 'key_5', 'value_nl_1'],
['y', 'nl', 'key_2', 'value_nl_2'],
['y', 'ua', 'key_1', 'value_en_1']
]
# create DataFrame out of rows of data
df = pd.DataFrame(rows, columns=["static", "language", "keys", "values"])
# print out DataFrame
#print(df)
and the output is:
print (df)
static language Missing Keys
0 x de 0
1 x en 0
2 x nl 1 key_3
3 x ua 2 key_3, key_2
4 y de -3
5 y en 0
6 y nl 1 key_1
7 y ua 1 key_2
The problem is de
for y
static, there are more keys, as in the en language.
source to share
First, you can create the missing column by grouping and counting the number of nans. Then create a key column and add a static column.
df2 = (
df.groupby('langs')['keys'].apply(lambda x: x.values)
.apply(pd.Series)
.assign(Missing=lambda x: x.isnull().sum(axis=1))
)
(
df2[['Missing']].assign(static=df.static.iloc[0],
Keys=df2.apply(lambda x: ','.join(df2.loc['en'].loc[x.isnull()]),axis=1))
)
Out[44]:
Missing Keys static
langs
de 0 x
en 0 x
nl 1 key_3 x
ua 2 key_2,key_3 x
source to share
# First we group with `language` and aggregate `static` with `min` (it always the same anyway)
# and `keys` with a lambda function that creates a `set`.
In [2]: grouped = df.groupby('language').agg({'static': 'min', 'keys': lambda x: set(x)})
# Then we get the missing keys...
In [3]: missing = (grouped['keys']['en'] - grouped['keys'])
# ... and count them
In [4]: missing_counts = missing.apply(len).rename('# Missing')
# Then we join all of this together and replace the keys with a joined string.
In [5]: grouped.drop('keys', axis=1).join(missing_counts).join(missing.apply(', '.join)).reset_index()
Out[5]:
language static # Missing keys
0 de x 0
1 en x 0
2 nl x 1 key_3
3 ua x 2 key_2, key_3
source to share
Since you are putting the tag R
in your question, how to do it with tidyr
and dplyr
:
library(dplyr);library(tidyr)
df %>%
complete(nesting(static, langs), keys) %>%
group_by(langs)%>%
summarise(Static=max(static),
Missing=sum(is.na(values)),
Keys=toString(keys[is.na(values)])
)
langs Static Missing Keys
<chr> <chr> <int> <chr>
1 de x 0
2 en x 0
3 nl x 1 key_3
4 ua x 2 key_2, key_3
Data
df <- read.table(text="static langs keys values
'x' 'en' 'key_1' 'value_en_1'
'x' 'en' 'key_2' 'value_en_2'
'x' 'en' 'key_3' 'value_en_3'
'x' 'de' 'key_1' 'value_de_1'
'x' 'de' 'key_2' 'value_de_2'
'x' 'de' 'key_3' 'value_de_3'
'x' 'nl' 'key_1' 'value_nl_1'
'x' 'nl' 'key_2' 'value_nl_2'
'x' 'ua' 'key_1' 'value_en_1'",header=TRUE,stringsAsFactors = FALSE)
source to share