Filtering redundant duplicate data from a pandas frame
I have a dataframe that looks something like this:
index, othercols, FPN
ts1, otherStuff, val1
ts2, otherStuff, val2
ts3, otherStuff, val3
ts4, otherStuff, val4
....
tsn, otherStuff, valn
Due to the external data source, many of these values ββwill be repeated, so there will be multiple chunks up to 10,000 seconds long in a millimeter data frame that all repeat the same data for concurrent timestamps. For my purposes, at least this repetition is not required, so I want to remove all duplicate lines separately from the beginning and end of each section, e.g .:
1, 0 2, 0 3, 0 4, 5 5, 0 6, 0
becomes
1, 0 3, 0 4, 5 5, 0 6, 0
I was able to do it, but it is slower than I would like (takes 2 minutes for a single 60MB file, mostly in an application like below) and I think there must be a better way to do it
Here's my powerful solution, is there a faster / faster way to do this?
data=df['FPN']
shft_up=(copy.deepcopy(data)).tolist()
shft_dn=(copy.deepcopy(data)).tolist()
del shft_up[0]
shft_up=shft_up+[None]
del shft_dn[-1]
shft_dn=[None]+shft_dn
df['shft_up']=shft_up
df['shft_dn']=shft_dn
def is_rep(row):
if row['shft_dn']==row['FPN'] and row['shft_up']==row['FPN']:
return 1
else:
return 0
df['mask_col']=df.apply(lambda row:is_rep(row),axis=1,reduce=False)
df=(df[df['mask_col']==0]).drop(['shft_up','shft_dn','mask_col'],axis=1)
source to share
I think I have this logic correct, I add a new column "run" which is boolean to tell if the value is the same as the previous row value:
In [438]:
df['run'] = (df['val'] == df['val'].shift())
df
Out[438]:
id val run
0 1 0 False
1 2 0 True
2 3 0 True
3 4 5 False
4 5 0 False
5 6 0 True
Then I filter out the values ββwhere is executed True
and the next line also True
:
In [442]:
df[~((df['run']==True) & (df['run'].shift(-1) == True))]
Out[442]:
id val run
0 1 0 False
2 3 0 True
3 4 5 False
4 5 0 False
5 6 0 True
EDIT
The following one line file also works just for OP's confirmation:
In [447]:
df = df[(df['val'].shift()!=df['val'].shift(-1)) | (df['val']!=df['val'].shift(-1))]
df
Out[447]:
id val
0 1 0
2 3 0
3 4 5
4 5 0
5 6 0
source to share