Pandas delete rows if value is the same for a specific date range
I tried to find something similar but could find it. So I have a data structure. I want to delete rows with the same score for 5 days or more. Therefore, in the lower case, it is necessary to delete the PeronID AB-123 entries from 2/1 to 2/6, and also for the DG-3465 from 2/3 to 2/10. But nothing for the TY-9456. I thought to use roll (), but that will only remove 2 / 1-2 / 5 for AB-123, but not 2/6.
PersonID Date Score
AB-123 2/1/2016 0
AB-123 2/2/2016 0
AB-123 2/3/2016 0
AB-123 2/4/2016 0
AB-123 2/5/2016 0
AB-123 2/6/2016 0
AB-123 2/7/2016 67.5
AB-123 2/8/2016 73.4
AB-123 2/9/2016 70.5
AB-123 2/10/2016 68
DG-3465 2/1/2016 22.5
DG-3465 2/2/2016 25.6
DG-3465 2/3/2016 36.4
DG-3465 2/4/2016 36.4
DG-3465 2/5/2016 36.4
DG-3465 2/6/2016 36.4
DG-3465 2/7/2016 36.4
DG-3465 2/8/2016 36.4
DG-3465 2/9/2016 36.4
DG-3465 2/10/2016 36.4
TY-9456 2/1/2016 0
TY-9456 2/2/2016 0
TY-9456 2/3/2016 5.23
TY-9456 2/4/2016 4.12
TY-9456 2/5/2016 5.95
TY-9456 2/6/2016 6.97
TY-9456 2/7/2016 12.45
TY-9456 2/8/2016 15.61
TY-9456 2/9/2016 15.61
TY-9456 2/10/2016 15.61
Tried a few different things, but I kind of got stuck in the fact that nothing came up in my head. What do you suggest? Using python pandas by the way;)
source to share
You can, roll
in the Score column, calculate the standard deviation and then discard the rows where the standard deviations are zero along with five rows in front of them (this assumes you want to remove rows with the same values ββon consecutive days):
df.drop(np.unique(df.Score.rolling(5).std()[lambda x: x == 0].index.values - pd.np.arange(5)[:, None]))
source to share
You are grouping shift and cumsum (). Edited to include @Scott Boston suggestion
df.groupby(['PersonID',(df.Score != df.Score.shift()).cumsum()]).filter(lambda x: x.Score.size < 5)
PersonID Date Score
6 AB-123 2/7/2016 67.50
7 AB-123 2/8/2016 73.40
8 AB-123 2/9/2016 70.50
9 AB-123 2/10/2016 68.00
10 DG-3465 2/1/2016 22.50
11 DG-3465 2/2/2016 25.60
20 TY-9456 2/1/2016 0.00
21 TY-9456 2/2/2016 0.00
22 TY-9456 2/3/2016 5.23
23 TY-9456 2/4/2016 4.12
24 TY-9456 2/5/2016 5.95
25 TY-9456 2/6/2016 6.97
26 TY-9456 2/7/2016 12.45
27 TY-9456 2/8/2016 15.61
28 TY-9456 2/9/2016 15.61
29 TY-9456 2/10/2016 15.61
source to share
You can exclude those lines that differ by 0 and are offset by 1 day:
In [11]: df[(df.Score.diff() != 0) | (df.Date.diff() != pd.offsets.Day().delta)]
Out[11]:
PersonID Date Score
0 AB-123 2016-02-01 0.00
6 AB-123 2016-02-07 67.50
7 AB-123 2016-02-08 73.40
8 AB-123 2016-02-09 70.50
9 AB-123 2016-02-10 68.00
10 DG-3465 2016-02-01 22.50
11 DG-3465 2016-02-02 25.60
12 DG-3465 2016-02-03 36.40
20 TY-9456 2016-02-01 0.00
22 TY-9456 2016-02-03 5.23
23 TY-9456 2016-02-04 4.12
24 TY-9456 2016-02-05 5.95
25 TY-9456 2016-02-06 6.97
26 TY-9456 2016-02-07 12.45
27 TY-9456 2016-02-08 15.61
source to share