SQL deduplex based on the number of matching columns

Question

SQL deduplex based on the number of matching columns

I am trying to find information on how to dedupe a table based on the number of matching columns between records.

Let's say my datasource looks like

---------------------------------------------------
| ColumnA | ColumnB | ColumnC | ColumnD | ColumnN |
---------------------------------------------------
| Peter   | Dink    | Midget  | NULL    | 0738455 |
| Peter   | Dink    | Child   | 334AA   | 49595   |
| Mark    | Walhg   | Funky   | 334AA   | 0738455 | 
| Mark    | Dink    | NULL    | NULL    | NULL    |
| Mark    | Walhg   | Funky   | 334AA   | NULL    |
| Peter   | Dink    | NULL    | NULL    | 0738455 |
---------------------------------------------------

Basically, I want to be able to count the number of records that separate 2, 3, 4, etc. data columns; however, I need this to be limited to only the selected subset of the columns (and ignore NULLs / spaces).

From the above data, I would like to say:

There are 0 records that correspond to 5 columns
There are 1 records that correspond to 4 columns (3,5)
There are 1 records that match 3 columns each (1.6) (3.5)
There are 2 entries that match 2 columns each (1.6) (2.6) (3.5) (1.2)

I also need it to "shift" downward as the number of matched columns gets smaller and smaller. So in the data above, my data is the same after checking the five columns match. Then the 4 columns data is reduced to:

---------------------------------------------------
| ColumnA | ColumnB | ColumnC | ColumnD | ColumnN |
---------------------------------------------------
| Peter   | Dink    | Midget  | NULL    | 0738455 |
| Peter   | Dink    | Child   | 334AA   | 49595   |
| Mark    | Walhg   | Funky   | 334AA   | 0738455 | 
| Mark    | Dink    | NULL    | NULL    | NULL    |
| Peter   | Dink    | NULL    | NULL    | 0738455 |
---------------------------------------------------

The 5th column disappeared because it was disabled (I have no idea how I figured out what was removed, possibly in some date column). So I can say that 1 record has been deleted.

After checking 3 columns:

---------------------------------------------------
| ColumnA | ColumnB | ColumnC | ColumnD | ColumnN |
---------------------------------------------------
| Peter   | Dink    | Midget  | NULL    | 0738455 |
| Peter   | Dink    | Child   | 334AA   | 49595   |
| Mark    | Walhg   | Funky   | 334AA   | 0738455 | 
| Mark    | Dink    | NULL    | NULL    | NULL    |
---------------------------------------------------

So, I can tell that 1 more is being removed.

Then 2 columns:

---------------------------------------------------
| ColumnA | ColumnB | ColumnC | ColumnD | ColumnN |
---------------------------------------------------
| Peter   | Dink    | Midget  | NULL    | 0738455 |
| Mark    | Walhg   | Funky   | 334AA   | 0738455 | 
| Mark    | Dink    | NULL    | NULL    | NULL    |
---------------------------------------------------

Another column has been deleted.

The way I thought I was approaching it was to give the weight, essentially the number of matching data points from the column selection. For example, perhaps I would not like to use the Country column for counting as one of the corresponding columns, I would only use those that identify the record, such as Name and Phone Number.

Then I can see how many records are being output for each weight (number of column matches) and decide that we will deduplicate everything with 7 matching columns of identity data; and collapse any values in one record that have NULL / blank in the duplicate record.

This is all very far from me. I know what I want to do; just don't know how to do it.

+3

sql tsql

NeomerArcana 22 oct. '14 at 6:24

source to share

1 answer

SubqueryCrunch · Answer 1 · 2014-10-22T07:15:22+0000

Hope I understood you correctly. This is my idea of how this can be done, but not completely, you can automate it with a dynamic sql and while loop to go through all ids and unify the results later.

IF OBJECT_ID('TestTable1') IS NOT NULL 
DROP TABLE TestTable1

CREATE TABLE TestTable1 (
    ID INT IDENTITY(1,1),
    ColumnA NVARCHAR(100),
    ColumnB NVARCHAR(100),
    ColumnC NVARCHAR(100),
    ColumnD NVARCHAR(100),
    ColumnE INT
)

INSERT INTO TestTable1 VALUES 
('Peter','Dink','Milk',NULL,0738455),
('Peter','Dink','Beer','334AA',49595),
('Mark','Walk','Funky','334AA',0738455),
('Mark','Dink',NULL,NULL,NULL),
('Mark','Walk','Funky','334AA',NULL),
('Peter','Dink',NULL,NULL,0738455)

DECLARE @ID INT
SET @ID = 1

SELECT * FROM TestTable1 WHERE ID IN 
(
    SELECT ID FROM
    (   
        SELECT @ID AS ID
        UNION
        SELECT b.ID FROM TestTable1 as a
        CROSS APPLY TestTable1 as b
        WHERE a.ColumnA = b.ColumnA
        AND a.ID = @ID AND b.ID <> @ID
    ) AS OneMatchingColumn
) 


SELECT * FROM TestTable1 WHERE ID IN 
(
    SELECT ID FROM
    (
        SELECT @ID AS ID
        UNION
        SELECT b.ID FROM TestTable1 as a
        CROSS APPLY TestTable1 as b
        WHERE a.ColumnA = b.ColumnA
        AND a.ColumnB = b.ColumnB
        AND a.ID = @ID AND b.ID <> @ID
    ) AS TwoMatchingColumns
)


SELECT * FROM TestTable1 WHERE ID IN 
(
    SELECT ID FROM
    (
        SELECT @ID AS ID
        UNION
        SELECT b.ID FROM TestTable1 as a
        CROSS APPLY TestTable1 as b
        WHERE a.ColumnA = b.ColumnA
        AND a.ColumnB = b.ColumnB
        AND a.ColumnC = b.ColumnC
        AND a.ID = @ID AND b.ID <> @ID
    ) AS ThreeMatchingColumns
)

SQL deduplex based on the number of matching columns

More articles: