Nearly Duplicate Lines Detection

Let's say I have a table with dates and a value for each date (plus other columns). I can find lines that have the same value on the same day using

data.duplicated(subset=["VALUE", "DAY"], keep=False)

      

Now tell me that I want the day to be off by 1 or 2, and the value to be off until 10, how do I do that?

Example:

DAY MTH YYY VALUE   NAME
22  9   2016    8.25    John
22  9   2016    43      John
6   11  2016    28.25   Mary
2   10  2016    50  George
23  11  2016    90  George
23  10  2016    30  Jenn
24  8   2016    10  Mike
24  9   2016    10  Mike
24  10  2016    10  Mike
24  11  2016    10  Mike
13  9   2016    170 Kathie
13  10  2016    170 Kathie
13  11  2016    160 Kathie
8   9   2016    16  Gina
9   10  2016    16  Gina
8   11  2016    16  Gina
16  11  2016    25  Ross
21  11  2016    45  Ross
23  9   2016    50  Shari
23  10  2016    50  Shari
23  11  2016    50  Shari

      

Using the above code, I can find:

DAY MTH YYY VALUE   NAME
24  8   2016    10  Mike
24  9   2016    10  Mike
24  10  2016    10  Mike
24  11  2016    10  Mike
23  9   2016    50  Shari
23  10  2016    50  Shari
23  11  2016    50  Shari

      

However, I would also like to find the 16 values ​​for Gina on Aug 8, Sep 9, and Oct 8, because they have the same value and, although not on the same day, it is just a holiday.

Likewise, I want to define values ​​for Sept 13, Oct 13, and Nov 13 for Kathie, because the value is only off by 10.

How can i do this?

+3


source to share


2 answers


use numpy

and triangle indexing to display all combinations



day = df.DAY.values
val = df.VALUE.values

i, j = np.triu_indices(len(df), k=1)
c1 = np.abs(day[i] - day[j]) < 2
c2 = np.abs(val[i] - val[j]) < 10

c = c1 & c2
df.iloc[np.unique(np.append(i[c], j[c]))]

    DAY  MTH   YYY  VALUE    NAME
1    22    9  2016   43.0    John
6    24    8  2016   10.0    Mike
7    24    9  2016   10.0    Mike
8    24   10  2016   10.0    Mike
9    24   11  2016   10.0    Mike
10   13    9  2016  170.0  Kathie
11   13   10  2016  170.0  Kathie
13    8    9  2016   16.0    Gina
14    9   10  2016   16.0    Gina
15    8   11  2016   16.0    Gina
17   21   11  2016   45.0    Ross
18   23    9  2016   50.0   Shari
19   23   10  2016   50.0   Shari
20   23   11  2016   50.0   Shari

      

+2


source


Hard forced:

    df_data = df_data.sort_values(['DAY','VALUE'])
    df_data['Dup'] = False

    prev_row = pd.Series()
    prev_idx = None
    for idx, row in df_data.iterrows():
        if not prev_row.empty:
            if (abs(row['DAY'] - prev_row['DAY']) <=2) & \
               (abs(row['VALUE'] - prev_row['VALUE']) <=10):
                df_data['Dup'][idx] = True
                df_data['Dup'][prev_idx] = True
        prev_row, prev_idx  = row, idx

    print df_data

      

gives:



    DAY  MTH   YYY   VALUE    Dup
3     2   10  2016   50.00  False
2     6   11  2016   28.25  False
13    8    9  2016   16.00   True
15    8   11  2016   16.00   True
14    9   10  2016   16.00   True
12   13   11  2016  160.00   True
10   13    9  2016  170.00   True
11   13   10  2016  170.00   True
16   16   11  2016   25.00  False
17   21   11  2016   45.00  False
0    22    9  2016    8.25  False
1    22    9  2016   43.00  False
5    23   10  2016   30.00  False
18   23    9  2016   50.00   True
19   23   10  2016   50.00   True
20   23   11  2016   50.00   True
4    23   11  2016   90.00  False
6    24    8  2016   10.00   True
7    24    9  2016   10.00   True
8    24   10  2016   10.00   True
9    24   11  2016   10.00   True

      

Is this the desired result?

+2


source







All Articles