Pandas delete rows if value is the same for a specific date range

I tried to find something similar but could find it. So I have a data structure. I want to delete rows with the same score for 5 days or more. Therefore, in the lower case, it is necessary to delete the PeronID AB-123 entries from 2/1 to 2/6, and also for the DG-3465 from 2/3 to 2/10. But nothing for the TY-9456. I thought to use roll (), but that will only remove 2 / 1-2 / 5 for AB-123, but not 2/6.

PersonID    Date    Score
AB-123  2/1/2016    0
AB-123  2/2/2016    0
AB-123  2/3/2016    0
AB-123  2/4/2016    0
AB-123  2/5/2016    0
AB-123  2/6/2016    0
AB-123  2/7/2016    67.5
AB-123  2/8/2016    73.4
AB-123  2/9/2016    70.5
AB-123  2/10/2016   68
DG-3465 2/1/2016    22.5
DG-3465 2/2/2016    25.6
DG-3465 2/3/2016    36.4
DG-3465 2/4/2016    36.4
DG-3465 2/5/2016    36.4
DG-3465 2/6/2016    36.4
DG-3465 2/7/2016    36.4
DG-3465 2/8/2016    36.4
DG-3465 2/9/2016    36.4
DG-3465 2/10/2016   36.4
TY-9456 2/1/2016    0
TY-9456 2/2/2016    0
TY-9456 2/3/2016    5.23
TY-9456 2/4/2016    4.12
TY-9456 2/5/2016    5.95
TY-9456 2/6/2016    6.97
TY-9456 2/7/2016    12.45
TY-9456 2/8/2016    15.61
TY-9456 2/9/2016    15.61
TY-9456 2/10/2016   15.61

      

Tried a few different things, but I kind of got stuck in the fact that nothing came up in my head. What do you suggest? Using python pandas by the way;)

+3


source to share


3 answers


You can, roll

in the Score column, calculate the standard deviation and then discard the rows where the standard deviations are zero along with five rows in front of them (this assumes you want to remove rows with the same values ​​on consecutive days):

df.drop(np.unique(df.Score.rolling(5).std()[lambda x: x == 0].index.values - pd.np.arange(5)[:, None]))

      



enter image description here

+4


source


You are grouping shift and cumsum (). Edited to include @Scott Boston suggestion



df.groupby(['PersonID',(df.Score != df.Score.shift()).cumsum()]).filter(lambda x: x.Score.size < 5)


    PersonID    Date    Score
6   AB-123  2/7/2016    67.50
7   AB-123  2/8/2016    73.40
8   AB-123  2/9/2016    70.50
9   AB-123  2/10/2016   68.00
10  DG-3465 2/1/2016    22.50
11  DG-3465 2/2/2016    25.60
20  TY-9456 2/1/2016    0.00
21  TY-9456 2/2/2016    0.00
22  TY-9456 2/3/2016    5.23
23  TY-9456 2/4/2016    4.12
24  TY-9456 2/5/2016    5.95
25  TY-9456 2/6/2016    6.97
26  TY-9456 2/7/2016    12.45
27  TY-9456 2/8/2016    15.61
28  TY-9456 2/9/2016    15.61
29  TY-9456 2/10/2016   15.61

      

+3


source


You can exclude those lines that differ by 0 and are offset by 1 day:

In [11]: df[(df.Score.diff() != 0) | (df.Date.diff() != pd.offsets.Day().delta)]
Out[11]:
   PersonID       Date  Score
0    AB-123 2016-02-01   0.00
6    AB-123 2016-02-07  67.50
7    AB-123 2016-02-08  73.40
8    AB-123 2016-02-09  70.50
9    AB-123 2016-02-10  68.00
10  DG-3465 2016-02-01  22.50
11  DG-3465 2016-02-02  25.60
12  DG-3465 2016-02-03  36.40
20  TY-9456 2016-02-01   0.00
22  TY-9456 2016-02-03   5.23
23  TY-9456 2016-02-04   4.12
24  TY-9456 2016-02-05   5.95
25  TY-9456 2016-02-06   6.97
26  TY-9456 2016-02-07  12.45
27  TY-9456 2016-02-08  15.61

      

0


source







All Articles