Pandas delete rows if value is the same for a specific date range

Question

Pandas delete rows if value is the same for a specific date range

I tried to find something similar but could find it. So I have a data structure. I want to delete rows with the same score for 5 days or more. Therefore, in the lower case, it is necessary to delete the PeronID AB-123 entries from 2/1 to 2/6, and also for the DG-3465 from 2/3 to 2/10. But nothing for the TY-9456. I thought to use roll (), but that will only remove 2 / 1-2 / 5 for AB-123, but not 2/6.

PersonID    Date    Score
AB-123  2/1/2016    0
AB-123  2/2/2016    0
AB-123  2/3/2016    0
AB-123  2/4/2016    0
AB-123  2/5/2016    0
AB-123  2/6/2016    0
AB-123  2/7/2016    67.5
AB-123  2/8/2016    73.4
AB-123  2/9/2016    70.5
AB-123  2/10/2016   68
DG-3465 2/1/2016    22.5
DG-3465 2/2/2016    25.6
DG-3465 2/3/2016    36.4
DG-3465 2/4/2016    36.4
DG-3465 2/5/2016    36.4
DG-3465 2/6/2016    36.4
DG-3465 2/7/2016    36.4
DG-3465 2/8/2016    36.4
DG-3465 2/9/2016    36.4
DG-3465 2/10/2016   36.4
TY-9456 2/1/2016    0
TY-9456 2/2/2016    0
TY-9456 2/3/2016    5.23
TY-9456 2/4/2016    4.12
TY-9456 2/5/2016    5.95
TY-9456 2/6/2016    6.97
TY-9456 2/7/2016    12.45
TY-9456 2/8/2016    15.61
TY-9456 2/9/2016    15.61
TY-9456 2/10/2016   15.61

Tried a few different things, but I kind of got stuck in the fact that nothing came up in my head. What do you suggest? Using python pandas by the way;)

+3

python pandas dataframe delete-row

PyRaider May 16 '17 at 19:23

source to share

3 answers

You are grouping shift and cumsum (). Edited to include @Scott Boston suggestion

df.groupby(['PersonID',(df.Score != df.Score.shift()).cumsum()]).filter(lambda x: x.Score.size < 5)


    PersonID    Date    Score
6   AB-123  2/7/2016    67.50
7   AB-123  2/8/2016    73.40
8   AB-123  2/9/2016    70.50
9   AB-123  2/10/2016   68.00
10  DG-3465 2/1/2016    22.50
11  DG-3465 2/2/2016    25.60
20  TY-9456 2/1/2016    0.00
21  TY-9456 2/2/2016    0.00
22  TY-9456 2/3/2016    5.23
23  TY-9456 2/4/2016    4.12
24  TY-9456 2/5/2016    5.95
25  TY-9456 2/6/2016    6.97
26  TY-9456 2/7/2016    12.45
27  TY-9456 2/8/2016    15.61
28  TY-9456 2/9/2016    15.61
29  TY-9456 2/10/2016   15.61

+3

Vaishali May 16 '17 at 19:35

source to share

You can exclude those lines that differ by 0 and are offset by 1 day:

In [11]: df[(df.Score.diff() != 0) | (df.Date.diff() != pd.offsets.Day().delta)]
Out[11]:
   PersonID       Date  Score
0    AB-123 2016-02-01   0.00
6    AB-123 2016-02-07  67.50
7    AB-123 2016-02-08  73.40
8    AB-123 2016-02-09  70.50
9    AB-123 2016-02-10  68.00
10  DG-3465 2016-02-01  22.50
11  DG-3465 2016-02-02  25.60
12  DG-3465 2016-02-03  36.40
20  TY-9456 2016-02-01   0.00
22  TY-9456 2016-02-03   5.23
23  TY-9456 2016-02-04   4.12
24  TY-9456 2016-02-05   5.95
25  TY-9456 2016-02-06   6.97
26  TY-9456 2016-02-07  12.45
27  TY-9456 2016-02-08  15.61

0

Andy Hayden May 16 '17 at 19:28

source to share

Psidom · Accepted Answer · 2017-05-16T19:43:50+0000

You can, roll

in the Score column, calculate the standard deviation and then discard the rows where the standard deviations are zero along with five rows in front of them (this assumes you want to remove rows with the same values on consecutive days):

df.drop(np.unique(df.Score.rolling(5).std()[lambda x: x == 0].index.values - pd.np.arange(5)[:, None]))

Pandas delete rows if value is the same for a specific date range

More articles: