# Nearly Duplicate Lines Detection

Let's say I have a table with dates and a value for each date (plus other columns). I can find lines that have the same value on the same day using

``````data.duplicated(subset=["VALUE", "DAY"], keep=False)
```

```

Now tell me that I want the day to be off by 1 or 2, and the value to be off until 10, how do I do that?

Example:

``````DAY MTH YYY VALUE   NAME
22  9   2016    8.25    John
22  9   2016    43      John
6   11  2016    28.25   Mary
2   10  2016    50  George
23  11  2016    90  George
23  10  2016    30  Jenn
24  8   2016    10  Mike
24  9   2016    10  Mike
24  10  2016    10  Mike
24  11  2016    10  Mike
13  9   2016    170 Kathie
13  10  2016    170 Kathie
13  11  2016    160 Kathie
8   9   2016    16  Gina
9   10  2016    16  Gina
8   11  2016    16  Gina
16  11  2016    25  Ross
21  11  2016    45  Ross
23  9   2016    50  Shari
23  10  2016    50  Shari
23  11  2016    50  Shari
```

```

Using the above code, I can find:

``````DAY MTH YYY VALUE   NAME
24  8   2016    10  Mike
24  9   2016    10  Mike
24  10  2016    10  Mike
24  11  2016    10  Mike
23  9   2016    50  Shari
23  10  2016    50  Shari
23  11  2016    50  Shari
```

```

However, I would also like to find the 16 values ββfor Gina on Aug 8, Sep 9, and Oct 8, because they have the same value and, although not on the same day, it is just a holiday.

Likewise, I want to define values ββfor Sept 13, Oct 13, and Nov 13 for Kathie, because the value is only off by 10.

How can i do this?

+3

source to share

use `numpy`

and triangle indexing to display all combinations

``````day = df.DAY.values
val = df.VALUE.values

i, j = np.triu_indices(len(df), k=1)
c1 = np.abs(day[i] - day[j]) < 2
c2 = np.abs(val[i] - val[j]) < 10

c = c1 & c2
df.iloc[np.unique(np.append(i[c], j[c]))]

DAY  MTH   YYY  VALUE    NAME
1    22    9  2016   43.0    John
6    24    8  2016   10.0    Mike
7    24    9  2016   10.0    Mike
8    24   10  2016   10.0    Mike
9    24   11  2016   10.0    Mike
10   13    9  2016  170.0  Kathie
11   13   10  2016  170.0  Kathie
13    8    9  2016   16.0    Gina
14    9   10  2016   16.0    Gina
15    8   11  2016   16.0    Gina
17   21   11  2016   45.0    Ross
18   23    9  2016   50.0   Shari
19   23   10  2016   50.0   Shari
20   23   11  2016   50.0   Shari
```

```
+2

source

Hard forced:

``````    df_data = df_data.sort_values(['DAY','VALUE'])
df_data['Dup'] = False

prev_row = pd.Series()
prev_idx = None
for idx, row in df_data.iterrows():
if not prev_row.empty:
if (abs(row['DAY'] - prev_row['DAY']) <=2) & \
(abs(row['VALUE'] - prev_row['VALUE']) <=10):
df_data['Dup'][idx] = True
df_data['Dup'][prev_idx] = True
prev_row, prev_idx  = row, idx

print df_data
```

```

gives:

``````    DAY  MTH   YYY   VALUE    Dup
3     2   10  2016   50.00  False
2     6   11  2016   28.25  False
13    8    9  2016   16.00   True
15    8   11  2016   16.00   True
14    9   10  2016   16.00   True
12   13   11  2016  160.00   True
10   13    9  2016  170.00   True
11   13   10  2016  170.00   True
16   16   11  2016   25.00  False
17   21   11  2016   45.00  False
0    22    9  2016    8.25  False
1    22    9  2016   43.00  False
5    23   10  2016   30.00  False
18   23    9  2016   50.00   True
19   23   10  2016   50.00   True
20   23   11  2016   50.00   True
4    23   11  2016   90.00  False
6    24    8  2016   10.00   True
7    24    9  2016   10.00   True
8    24   10  2016   10.00   True
9    24   11  2016   10.00   True
```

```

Is this the desired result?

+2

source

All Articles