Replacing more than n consecutive values in a Pandas DataFrame column

Question

Replacing more than n consecutive values in a Pandas DataFrame column

Suppose I have the following DataFrame df

df = pd.DataFrame({"a" : [1,2,2,2,2,2,2,2,2,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5], "b" : [3,3,3,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,7,7], "c" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,1,2,2,2,2,2,2,2,2,3,3]})

And I want to replace number 4, which repeats more than 10 times in a row, in any column (there could be hundreds of columns), with 10 4 and the rest 5.

So, for example, 12 consecutive 4s will be replaced by ten 4s and two 5s.

How can I achieve this using Pandas?

I would love to apply a lambda, but I don't know how to look back enough lines and it must start from the end and move forward, otherwise it will break the sequence of values. Each search will have to look at the previous 10 lines to see if they are all 4, and if so, set the current value to 5.

Not sure how to do it though!

+3

python pandas replace multiple-columns cumsum

Chris 22 Mar 17 at 8:25

source to share

3 answers

Remove all 4s, fill in 4s with limit=10

as argument, and remove the remaining NA from 5s. I find this method more explicit and reflecting more of your intent:

df[df!=4].fillna(4, limit=10).fillna(5)

Cast the df back to integers with astype(int)

at the end if necessary, as an NA intrusion will cause the dataframe to be floated.

+2

Boud 22 Mar 17 at 9:28 am

source to share

This should do the trick:

import pandas as pd

df = pd.DataFrame({"a" : [1,2,2,2,2,2,2,2,2,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5], "b" : [3,3,3,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,7,7], "c" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,1,2,2,2,2,2,2,2,2,3,3]})

def replacer(l,target_val=4,replace_val=5,repeat_max=10):
    counter = 0
    new_l = []
    for e in l:
        if e == target_val: counter += 1
        else:
            counter = 0

        if counter > repeat_max:
            new_l.append(replace_val)
        else:
            new_l.append(e)

    return new_l

df1 = df.apply(replacer)

Output:

    a  b  c
0   1  3  4
1   2  3  4
2   2  3  4
3   2  3  4
4   2  3  4
5   2  3  4
6   2  3  4
7   2  4  4
8   2  4  4
9   3  4  4
10  3  4  5
11  4  5  5
12  4  5  5
13  4  5  5
14  4  5  5
15  4  5  5
16  4  5  5
17  4  5  5
18  4  5  5
19  4  5  5
20  4  5  5
21  5  5  1
22  5  5  2
23  5  5  2
24  5  5  2
25  5  5  2
26  5  5  2
27  5  5  2
28  5  6  2
29  5  6  2
30  5  7  3
31  5  7  3

+1

Alex fung 22 Mar 17 at 8:57

source to share

jezrael · Accepted Answer · 2017-03-22T09:04:00+0000

You can use:

#column a is changed for 2 groups of 4
df = pd.DataFrame({
"a" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,7,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5], 
"b" : [3,3,3,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,7,7], 
"c" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,1,2,2,2,2,2,2,2,2,3,3]})

The solution calculates the sequence 4 with the reset, if NaN

created where

, and then applies boolean mask

to the original df

replacement 4

on 5

to mask

:

a = df == 4
mask = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0) > 10
df1 = df.mask(mask, 5)

print (df1)
    a  b  c
0   4  3  4
1   4  3  4
2   4  3  4
3   4  3  4
4   4  3  4
5   4  3  4
6   4  3  4
7   4  4  4
8   4  4  4
9   4  4  4
10  5  4  5
11  5  5  5
12  5  5  5
13  5  5  5
14  7  5  5
15  4  5  5
16  4  5  5
17  4  5  5
18  4  5  5
19  4  5  5
20  4  5  5
21  4  5  1
22  4  5  2
23  4  5  2
24  4  5  2
25  5  5  2
26  5  5  2
27  5  5  2
28  5  6  2
29  5  6  2
30  5  7  3
31  5  7  3

For better control, you can use concat

:

print (pd.concat([df, df1], axis=1, keys=['orig','new']))
   orig       new      
      a  b  c   a  b  c
0     4  3  4   4  3  4
1     4  3  4   4  3  4
2     4  3  4   4  3  4
3     4  3  4   4  3  4
4     4  3  4   4  3  4
5     4  3  4   4  3  4
6     4  3  4   4  3  4
7     4  4  4   4  4  4
8     4  4  4   4  4  4
9     4  4  4   4  4  4
10    4  4  4   5  4  5
11    4  5  4   5  5  5
12    4  5  4   5  5  5
13    4  5  4   5  5  5
14    7  5  4   7  5  5
15    4  5  4   4  5  5
16    4  5  4   4  5  5
17    4  5  4   4  5  5
18    4  5  5   4  5  5
19    4  5  5   4  5  5
20    4  5  5   4  5  5
21    4  5  1   4  5  1
22    4  5  2   4  5  2
23    4  5  2   4  5  2
24    4  5  2   4  5  2
25    4  5  2   5  5  2
26    4  5  2   5  5  2
27    4  5  2   5  5  2
28    4  6  2   5  6  2
29    5  6  2   5  6  2
30    5  7  3   5  7  3
31    5  7  3   5  7  3

Replacing more than n consecutive values ​​in a Pandas DataFrame column

More articles:

Replacing more than n consecutive values in a Pandas DataFrame column