Pandas - Replace Outliers with Group Value
I have a pandas framework that I would like to split into groups, calculate the mean and standard deviation, and then replace all outliers with the group mean. Outliers are defined as such if they are greater than three standard deviations from the group mean.
df = pandas.DataFrame({'a': ['A','A','A','B','B','B','B'], 'b': [1.1,1.2,1.1,3.3,3.4,3.3,100.0]})
I thought the following would work:
df.groupby('a')['b'].transform(lambda x: x[i] if np.abs(x[i]-x.mean())<=(3*x.std()) else x.mean() for i in range(0,len(x)))
but get the following error:
NameError: name 'x' is undefined
I also tried to define the transform function separately:
def trans_func(x):
mean = x.mean()
std = x.std()
length = len(x)
for i in range(0,length):
if abs(x[i]-mean)<=(3*std):
return x
else:
return mean
and then calls it like this:
df.groupby('a')['b'].transform(lambda x: trans_func(x))
but I am getting another error:
KeyError: 0
Finally, I resorted to creating a separate column:
df['c'] = [df.groupby('a')['b'].transform(mean) if df.groupby('a')['b'].transform(lambda x: (x - x.mean()) / x.std()) > 3 else df['b']]
but that didn't work either:
ValueError: Series truth value is ambiguous. Use the commands a.empty, a.bool (), a.item (), a.any (), or a.all ().
Any advice is greatly appreciated.
source to share
Try the following:
def replace(group):
mean, std = group.mean(), group.std()
outliers = (group - mean).abs() > 3*std
group[outliers] = mean # or "group[~outliers].mean()"
return group
df.groupby('a').transform(replace)
Note. If you want to remove 100 in your last group, you can replace 3*std
with 1*std
. The standard deviation in this group is 48.33, so it will be included in the result.
source to share
Hope this is helpful:
Step 1, remove outliers (link to pandas group by removing outliers ):
def is_outlier(s):
lower_limit = s.mean() - (s.std() * 3)
upper_limit = s.mean() + (s.std() * 3)
return ~s.between(lower_limit, upper_limit)
df = df[~df.groupby('a')['count'].apply(is_outlier)]
Step 2, replace the outlier (link from elyase):
def replace(group):
mean, std = group.mean(), group.std()
outliers = (group - mean).abs() > 3*std
group[outliers] = mean # or "group[~outliers].mean()"
return group
df.groupby('a').transform(replace)
source to share