Filtering redundant duplicate data from a pandas frame

I have a dataframe that looks something like this:

index, othercols, FPN
ts1, otherStuff, val1
ts2, otherStuff, val2
ts3, otherStuff, val3
ts4, otherStuff, val4
....
tsn, otherStuff, valn

      

Due to the external data source, many of these values ​​will be repeated, so there will be multiple chunks up to 10,000 seconds long in a millimeter data frame that all repeat the same data for concurrent timestamps. For my purposes, at least this repetition is not required, so I want to remove all duplicate lines separately from the beginning and end of each section, e.g .:

1, 0
2, 0
3, 0
4, 5
5, 0
6, 0

      

becomes

1, 0 
3, 0
4, 5
5, 0
6, 0

      

I was able to do it, but it is slower than I would like (takes 2 minutes for a single 60MB file, mostly in an application like below) and I think there must be a better way to do it

Here's my powerful solution, is there a faster / faster way to do this?

data=df['FPN']

shft_up=(copy.deepcopy(data)).tolist()
shft_dn=(copy.deepcopy(data)).tolist()

del shft_up[0]
shft_up=shft_up+[None]

del shft_dn[-1]
shft_dn=[None]+shft_dn

df['shft_up']=shft_up
df['shft_dn']=shft_dn

def is_rep(row):
    if row['shft_dn']==row['FPN'] and row['shft_up']==row['FPN']:
        return 1
    else:
        return 0  

df['mask_col']=df.apply(lambda row:is_rep(row),axis=1,reduce=False)

df=(df[df['mask_col']==0]).drop(['shft_up','shft_dn','mask_col'],axis=1)

      

+3


source to share


1 answer


I think I have this logic correct, I add a new column "run" which is boolean to tell if the value is the same as the previous row value:

In [438]:

df['run'] = (df['val'] == df['val'].shift())
df
Out[438]:
   id  val    run
0   1    0  False
1   2    0   True
2   3    0   True
3   4    5  False
4   5    0  False
5   6    0   True

      

Then I filter out the values ​​where is executed True

and the next line also True

:

In [442]:

df[~((df['run']==True) & (df['run'].shift(-1) == True))]
Out[442]:
   id  val    run
0   1    0  False
2   3    0   True
3   4    5  False
4   5    0  False
5   6    0   True

      



EDIT

The following one line file also works just for OP's confirmation:

In [447]:

df = df[(df['val'].shift()!=df['val'].shift(-1)) | (df['val']!=df['val'].shift(-1))]
df
Out[447]:
   id  val
0   1    0
2   3    0
3   4    5
4   5    0
5   6    0

      

+1


source







All Articles