Pandas - Replace Outliers with Group Value

I have a pandas framework that I would like to split into groups, calculate the mean and standard deviation, and then replace all outliers with the group mean. Outliers are defined as such if they are greater than three standard deviations from the group mean.

df = pandas.DataFrame({'a': ['A','A','A','B','B','B','B'], 'b': [1.1,1.2,1.1,3.3,3.4,3.3,100.0]})

      

I thought the following would work:

df.groupby('a')['b'].transform(lambda x: x[i] if np.abs(x[i]-x.mean())<=(3*x.std()) else x.mean() for i in range(0,len(x)))

      

but get the following error:

NameError: name 'x' is undefined

I also tried to define the transform function separately:

def trans_func(x):
    mean = x.mean()
    std = x.std()
    length = len(x)
    for i in range(0,length):
        if abs(x[i]-mean)<=(3*std):
            return x
        else:
            return mean

      

and then calls it like this:

df.groupby('a')['b'].transform(lambda x: trans_func(x))

      

but I am getting another error:

KeyError: 0

Finally, I resorted to creating a separate column:

df['c'] = [df.groupby('a')['b'].transform(mean) if df.groupby('a')['b'].transform(lambda x: (x - x.mean()) / x.std()) > 3 else df['b']] 

      

but that didn't work either:

ValueError: Series truth value is ambiguous. Use the commands a.empty, a.bool (), a.item (), a.any (), or a.all ().

Any advice is greatly appreciated.

+3


source to share


3 answers


Try the following:

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

      



Note. If you want to remove 100 in your last group, you can replace 3*std

with 1*std

. The standard deviation in this group is 48.33, so it will be included in the result.

+6


source


It would be more appropriate to remove the outliers first and then calculate the group means for replacement. If the replacement mean is calculated using outliers, then the mean is influenced by outliers



+1


source


Hope this is helpful:

Step 1, remove outliers (link to pandas group by removing outliers ):

def is_outlier(s):
    lower_limit = s.mean() - (s.std() * 3)
    upper_limit = s.mean() + (s.std() * 3)
    return ~s.between(lower_limit, upper_limit)

df = df[~df.groupby('a')['count'].apply(is_outlier)]

      

Step 2, replace the outlier (link from elyase):

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

      

0


source







All Articles