Nearly Duplicate Lines Detection
Let's say I have a table with dates and a value for each date (plus other columns). I can find lines that have the same value on the same day using
data.duplicated(subset=["VALUE", "DAY"], keep=False)
Now tell me that I want the day to be off by 1 or 2, and the value to be off until 10, how do I do that?
Example:
DAY MTH YYY VALUE NAME
22 9 2016 8.25 John
22 9 2016 43 John
6 11 2016 28.25 Mary
2 10 2016 50 George
23 11 2016 90 George
23 10 2016 30 Jenn
24 8 2016 10 Mike
24 9 2016 10 Mike
24 10 2016 10 Mike
24 11 2016 10 Mike
13 9 2016 170 Kathie
13 10 2016 170 Kathie
13 11 2016 160 Kathie
8 9 2016 16 Gina
9 10 2016 16 Gina
8 11 2016 16 Gina
16 11 2016 25 Ross
21 11 2016 45 Ross
23 9 2016 50 Shari
23 10 2016 50 Shari
23 11 2016 50 Shari
Using the above code, I can find:
DAY MTH YYY VALUE NAME
24 8 2016 10 Mike
24 9 2016 10 Mike
24 10 2016 10 Mike
24 11 2016 10 Mike
23 9 2016 50 Shari
23 10 2016 50 Shari
23 11 2016 50 Shari
However, I would also like to find the 16 values ββfor Gina on Aug 8, Sep 9, and Oct 8, because they have the same value and, although not on the same day, it is just a holiday.
Likewise, I want to define values ββfor Sept 13, Oct 13, and Nov 13 for Kathie, because the value is only off by 10.
How can i do this?
source to share
use numpy
and triangle indexing to display all combinations
day = df.DAY.values
val = df.VALUE.values
i, j = np.triu_indices(len(df), k=1)
c1 = np.abs(day[i] - day[j]) < 2
c2 = np.abs(val[i] - val[j]) < 10
c = c1 & c2
df.iloc[np.unique(np.append(i[c], j[c]))]
DAY MTH YYY VALUE NAME
1 22 9 2016 43.0 John
6 24 8 2016 10.0 Mike
7 24 9 2016 10.0 Mike
8 24 10 2016 10.0 Mike
9 24 11 2016 10.0 Mike
10 13 9 2016 170.0 Kathie
11 13 10 2016 170.0 Kathie
13 8 9 2016 16.0 Gina
14 9 10 2016 16.0 Gina
15 8 11 2016 16.0 Gina
17 21 11 2016 45.0 Ross
18 23 9 2016 50.0 Shari
19 23 10 2016 50.0 Shari
20 23 11 2016 50.0 Shari
source to share
Hard forced:
df_data = df_data.sort_values(['DAY','VALUE'])
df_data['Dup'] = False
prev_row = pd.Series()
prev_idx = None
for idx, row in df_data.iterrows():
if not prev_row.empty:
if (abs(row['DAY'] - prev_row['DAY']) <=2) & \
(abs(row['VALUE'] - prev_row['VALUE']) <=10):
df_data['Dup'][idx] = True
df_data['Dup'][prev_idx] = True
prev_row, prev_idx = row, idx
print df_data
gives:
DAY MTH YYY VALUE Dup
3 2 10 2016 50.00 False
2 6 11 2016 28.25 False
13 8 9 2016 16.00 True
15 8 11 2016 16.00 True
14 9 10 2016 16.00 True
12 13 11 2016 160.00 True
10 13 9 2016 170.00 True
11 13 10 2016 170.00 True
16 16 11 2016 25.00 False
17 21 11 2016 45.00 False
0 22 9 2016 8.25 False
1 22 9 2016 43.00 False
5 23 10 2016 30.00 False
18 23 9 2016 50.00 True
19 23 10 2016 50.00 True
20 23 11 2016 50.00 True
4 23 11 2016 90.00 False
6 24 8 2016 10.00 True
7 24 9 2016 10.00 True
8 24 10 2016 10.00 True
9 24 11 2016 10.00 True
Is this the desired result?
source to share