Pandas - Replace Outliers with Group Value

Question

Pandas - Replace Outliers with Group Value

I have a pandas framework that I would like to split into groups, calculate the mean and standard deviation, and then replace all outliers with the group mean. Outliers are defined as such if they are greater than three standard deviations from the group mean.

df = pandas.DataFrame({'a': ['A','A','A','B','B','B','B'], 'b': [1.1,1.2,1.1,3.3,3.4,3.3,100.0]})

I thought the following would work:

df.groupby('a')['b'].transform(lambda x: x[i] if np.abs(x[i]-x.mean())<=(3*x.std()) else x.mean() for i in range(0,len(x)))

but get the following error:

NameError: name 'x' is undefined

I also tried to define the transform function separately:

def trans_func(x):
    mean = x.mean()
    std = x.std()
    length = len(x)
    for i in range(0,length):
        if abs(x[i]-mean)<=(3*std):
            return x
        else:
            return mean

and then calls it like this:

df.groupby('a')['b'].transform(lambda x: trans_func(x))

but I am getting another error:

KeyError: 0

Finally, I resorted to creating a separate column:

df['c'] = [df.groupby('a')['b'].transform(mean) if df.groupby('a')['b'].transform(lambda x: (x - x.mean()) / x.std()) > 3 else df['b']]

but that didn't work either:

ValueError: Series truth value is ambiguous. Use the commands a.empty, a.bool (), a.item (), a.any (), or a.all ().

Any advice is greatly appreciated.

+3

python pandas

user3516758 Dec 24. At 14:58

source to share

3 answers

It would be more appropriate to remove the outliers first and then calculate the group means for replacement. If the replacement mean is calculated using outliers, then the mean is influenced by outliers

+1

Andrius Vabalas 06 jan. At 13:40

source to share

Hope this is helpful:

Step 1, remove outliers (link to pandas group by removing outliers ):

def is_outlier(s):
    lower_limit = s.mean() - (s.std() * 3)
    upper_limit = s.mean() + (s.std() * 3)
    return ~s.between(lower_limit, upper_limit)

df = df[~df.groupby('a')['count'].apply(is_outlier)]

Step 2, replace the outlier (link from elyase):

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

0

ahbon 01 Feb '19 at 7:24

source to share

elyase · Accepted Answer · 2014-12-24T15:10:32+0000

Try the following:

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

Note. If you want to remove 100 in your last group, you can replace 3*std

with 1*std

. The standard deviation in this group is 48.33, so it will be included in the result.

Pandas - Replace Outliers with Group Value

More articles: