Find duplicate records in a large table on multiple columns correctly

I have read a lot of threads on the subject now and tried a few things, but it didn't work as I hoped. I need some clarification and apologize if this counts as a duplicate thread.

My client has a Postgres database where one table contains just over 12 million records. They tasked me with finding duplicate records, retrieving them for viewing, and if everything looks ok, remove the duplicates.

My main concern is server performance. Should running DISTINCT queries on 12 million records consume a lot of resources?

Since my first task is to fetch records for viewing, say CSV, and not just delete them, my approach in PgAdmin was doing this in a file.

SELECT * 
FROM
    my_table
WHERE
my_table_id NOT IN (

                SELECT DISTINCT 
                    ON (
                        num_1,
                        num_2,
                        num_3,
                        num_4,
                        num_5,
                        my_date
                    )
                    my_table_id
                FROM
                    my_table
);

      

However, this request is taking a long time. After 20 minutes of execution, I stopped execution. To make things more complex, my client is reluctant to let me clone a local copy of the table due to strict security. They prefer everything to be done in a live hosting environment.

The definition of a table is pretty straightforward. Looks like this

CREATE TABLE my_table
(
    my_table_id bigserial NOT NULL,
    num_1 bigserial NOT NULL,
    num_2 bigserial NOT NULL,
    num_3 bigserial NOT NULL,
    num_4 numeric,
    num_5 integer,
    my_date date,
    my_text character varying
)

      

The primary key "my_table_id" has not been compromised and is always unique. Col "my_text" is not interesting in the query as it will be empty for all duplicates. These are only numeric and date fields that require matching. All columns (except my_table_id and my_text) must match across all records to qualify as a duplicate.

What is the best way to solve this problem? Is there a way available for the server that won't consume all resources in the host environment? Please help me understand the best approach!

Thanks you!

+3


source to share


2 answers


Must be used GROUP BY

and HAVING

to get duplicate records instead ofDISTINCT

subquery will find all duplicate records



SELECT * FROM
my_table mt
JOIN
(
     SELECT
            num_1,
            num_2,
            num_3,
            num_4,
            num_5,
            my_date
     FROM
            my_table
     GROUP BY num_1, num_2, num_3, num_4, num_5, my_date
     HAVING COUNT(*) >1
) T 
ON mt.num_1= T.num_1
and mt.num_2= T.num_2
and mt.num_3= T.num_3
and mt.num_4= T.num_4
and mt.num_5= T.num_5
and mt.my_date= T.my_date

      

+3


source


Another way to use analytic functions



select * from (
    select * , 
    count(*) over (partition by num1,num2,num3,num4,my_date) cnt
    from mytable
) t1 where cnt > 1

      

+2


source







All Articles