Nearly Duplicate Lines Detection

Question

Nearly Duplicate Lines Detection

Let's say I have a table with dates and a value for each date (plus other columns). I can find lines that have the same value on the same day using

data.duplicated(subset=["VALUE", "DAY"], keep=False)

Now tell me that I want the day to be off by 1 or 2, and the value to be off until 10, how do I do that?

Example:

DAY MTH YYY VALUE   NAME
22  9   2016    8.25    John
22  9   2016    43      John
6   11  2016    28.25   Mary
2   10  2016    50  George
23  11  2016    90  George
23  10  2016    30  Jenn
24  8   2016    10  Mike
24  9   2016    10  Mike
24  10  2016    10  Mike
24  11  2016    10  Mike
13  9   2016    170 Kathie
13  10  2016    170 Kathie
13  11  2016    160 Kathie
8   9   2016    16  Gina
9   10  2016    16  Gina
8   11  2016    16  Gina
16  11  2016    25  Ross
21  11  2016    45  Ross
23  9   2016    50  Shari
23  10  2016    50  Shari
23  11  2016    50  Shari

Using the above code, I can find:

DAY MTH YYY VALUE   NAME
24  8   2016    10  Mike
24  9   2016    10  Mike
24  10  2016    10  Mike
24  11  2016    10  Mike
23  9   2016    50  Shari
23  10  2016    50  Shari
23  11  2016    50  Shari

However, I would also like to find the 16 values for Gina on Aug 8, Sep 9, and Oct 8, because they have the same value and, although not on the same day, it is just a holiday.

Likewise, I want to define values for Sept 13, Oct 13, and Nov 13 for Kathie, because the value is only off by 10.

How can i do this?

+3

python pandas duplicates

user 22 Mar 17 at 17:25

source to share

2 answers

Hard forced:

    df_data = df_data.sort_values(['DAY','VALUE'])
    df_data['Dup'] = False

    prev_row = pd.Series()
    prev_idx = None
    for idx, row in df_data.iterrows():
        if not prev_row.empty:
            if (abs(row['DAY'] - prev_row['DAY']) <=2) & \
               (abs(row['VALUE'] - prev_row['VALUE']) <=10):
                df_data['Dup'][idx] = True
                df_data['Dup'][prev_idx] = True
        prev_row, prev_idx  = row, idx

    print df_data

gives:

    DAY  MTH   YYY   VALUE    Dup
3     2   10  2016   50.00  False
2     6   11  2016   28.25  False
13    8    9  2016   16.00   True
15    8   11  2016   16.00   True
14    9   10  2016   16.00   True
12   13   11  2016  160.00   True
10   13    9  2016  170.00   True
11   13   10  2016  170.00   True
16   16   11  2016   25.00  False
17   21   11  2016   45.00  False
0    22    9  2016    8.25  False
1    22    9  2016   43.00  False
5    23   10  2016   30.00  False
18   23    9  2016   50.00   True
19   23   10  2016   50.00   True
20   23   11  2016   50.00   True
4    23   11  2016   90.00  False
6    24    8  2016   10.00   True
7    24    9  2016   10.00   True
8    24   10  2016   10.00   True
9    24   11  2016   10.00   True

Is this the desired result?

+2

B. Shieh 22 Mar 17 at 18:21

source to share

piRSquared · Accepted Answer · 2017-03-22T19:18:20+0000

use numpy

and triangle indexing to display all combinations

day = df.DAY.values
val = df.VALUE.values

i, j = np.triu_indices(len(df), k=1)
c1 = np.abs(day[i] - day[j]) < 2
c2 = np.abs(val[i] - val[j]) < 10

c = c1 & c2
df.iloc[np.unique(np.append(i[c], j[c]))]

    DAY  MTH   YYY  VALUE    NAME
1    22    9  2016   43.0    John
6    24    8  2016   10.0    Mike
7    24    9  2016   10.0    Mike
8    24   10  2016   10.0    Mike
9    24   11  2016   10.0    Mike
10   13    9  2016  170.0  Kathie
11   13   10  2016  170.0  Kathie
13    8    9  2016   16.0    Gina
14    9   10  2016   16.0    Gina
15    8   11  2016   16.0    Gina
17   21   11  2016   45.0    Ross
18   23    9  2016   50.0   Shari
19   23   10  2016   50.0   Shari
20   23   11  2016   50.0   Shari

Nearly Duplicate Lines Detection

More articles: