Pandas - Groupby with conditional formula
Survived SibSp Parch
0 0 1 0
1 1 1 0
2 1 0 0
3 1 1 0
4 0 0 1
Given the dataframe above, is there an elegant way groupby
with a condition? I want to split the data into two groups based on the following conditions:
(df['SibSp'] > 0) | (df['Parch'] > 0) = New Group -"Has Family"
(df['SibSp'] == 0) & (df['Parch'] == 0) = New Group - "No Family"
then take funds from both of these groups and get a result similar to the following:
SurvivedMean
Has Family Mean
No Family Mean
Can this be done with groupby or do I need to add a new column using the above conditional?
source to share
An easy way to group is to use the sum of these two columns. If any of them are positive, the result will be greater than 1. And groupby accepts an arbitrary array if the length is the same as the length of the DataFrame, so you don't need to add a new column.
family = np.where((df['SibSp'] + df['Parch']) >= 1 , 'Has Family', 'No Family')
df.groupby(family)['Survived'].mean()
Out:
Has Family 0.5
No Family 1.0
Name: Survived, dtype: float64
source to share
Use only one condition if there are never values ββin columns SibSp
or Parch
less 0
:
m1 = (df['SibSp'] > 0) | (df['Parch'] > 0)
df = df.groupby(np.where(m1, 'Has Family', 'No Family'))['Survived'].mean()
print (df)
Has Family 0.5
No Family 1.0
Name: Survived, dtype: float64
If this is not possible, use both conditions first:
m1 = (df['SibSp'] > 0) | (df['Parch'] > 0)
m2 = (df['SibSp'] == 0) & (df['Parch'] == 0)
a = np.where(m1, 'Has Family',
np.where(m2, 'No Family', 'Not'))
df = df.groupby(a)['Survived'].mean()
print (df)
Has Family 0.5
No Family 1.0
Name: Survived, dtype: float64
source to share
You can define your conditions in the list and use the function group_by_condition
below to create a filtered list for each condition. After that, you can select the resulting elements using pattern matching:
df = [
{"Survived": 0, "SibSp": 1, "Parch": 0},
{"Survived": 1, "SibSp": 1, "Parch": 0},
{"Survived": 1, "SibSp": 0, "Parch": 0}]
conditions = [
lambda x: (x['SibSp'] > 0) or (x['Parch'] > 0), # has family
lambda x: (x['SibSp'] == 0) and (x['Parch'] == 0) # no family
]
def group_by_condition(l, conditions):
return [[item for item in l if condition(item)] for condition in conditions]
[has_family, no_family] = group_by_condition(df, conditions)
source to share