Performing a large update using multiple tables in PostgreSQL

I started working with Postgres a few weeks ago, and I've been trying to solve this problem for the last few days with mixed results: I have a table (10 million rows) that needs to be updated once a month with information stored in several other tables. During this update, nobody has access to the database, so nobody reads / writes data. During this time, I will be the only user of the database.

Table A contains 10 million rows. One column needs to be updated (recovery cost). Table B, Table C, and Table D recently calculated recovery costs for each row in Table A (so each Table B, C, and D has 10 million rows). Which table is used to update table A depends on the value, see below. I am trying to make this update with this query:

UPDATE table_A a
SET rebuilding_costs = 
    CASE
    WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs 
    WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs 
    WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs 
    ELSE NULL
    END
FROM table_B b
LEFT OUTER JOIN table_C c
    ON (b.address = c.address)
LEFT OUTER JOIN table_D d
    ON (b.address = d.address)
WHERE a.address = b.address
;

      

This query gives the correct result, but is a bit slow (25 minutes). The funny thing is, when table A has 1 million rows (instead of 10 million), it only takes 30 seconds. So when running this query on a table with 10 million rows, I expected the query to run within 5 minutes, but it took 25 minutes instead. That's when I tried to UPDATE in blocks, so I added this line to the query in the WHERE clause:

AND (a.id > 0 AND a.id < 1000000)

      

a.id is the primary key of table A. This example only updates the first million rows of table A. However, it took 3 minutes to complete. You will need to do this 10 times to update all 10 million rows to be 30 minutes.

I was also trying to prevent someone's request from being made with rows that will not be modified by the UPDATE by adding the following:

AND a.herbouwwaarde_indicatie IS DISTINCT FROM b.inhoud
AND a.herbouwwaarde_indicatie IS DISTINCT FROM c.inhoud
AND a.herbouwwaarde_indicatie IS DISTINCT FROM d.inhoud

      

Setting fillfactor to 70 and 50 helped improve the speed a bit, but I couldn't get it for less than 20 minutes.

I also tried to recreate table A, for example:

CREATE TABLE table_A_new

AS

SELECT a.address, 

    CASE
    WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs 
    WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs 
    WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs 
    ELSE NULL
    END rebuildingcosts

FROM table_A a
LEFT OUTER JOIN table_B b
    ON (a.address = b.address)
LEFT OUTER JOIN table_C c
    ON (a.address = c.address)
LEFT OUTER JOIN table_D d
    ON (a.address = d.address)
;

      

It is very fast (2 minutes) and gives the correct result. However, you need to recreate the entire table. Somehow it doesn't seem very efficient (and takes up a lot of storage space), so I started with UPDATE queries.

My question is, what's the best way? Is there a way to improve the performance of the UPDATE query, or would it be better to look for an alternative such as "create table" in the example above.

I need to use Postrgres, although I cannot switch to another DBMS.

This is the execution plan for the UPDATE query (now tables do not contain 10 million rows, but about 6-8 million rows):

Update on tabel_A a  (cost=902288.27..2150690.80 rows=6714762 width=65)
->  Hash Join  (cost=902288.27..2150690.80 rows=6714762 width=65)
      Hash Cond: ((b.adres)::text = a.adres)"
Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs) 
AND (c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
AND (d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
    ->  Hash Left Join  (cost=522527.27..1318059.42 rows=6716471 width=39)
          Hash Cond: ((b.adres)::text = (d.adres)::text)
          ->  Hash Right Join  (cost=295916.60..817658.93 rows=6716471 width=29)
                Hash Cond: ((c.adres)::text = (b.adres)::text)
                ->  Seq Scan on Tabel_C c (cost=0.00..240642.35 rows=7600735 width=19)
                ->  Hash  (cost=172605.71..172605.71 rows=6716471 width=19)
                      ->  Seq Scan on tabel_B b  (cost=0.00..172605.71 rows=6716471 width=19)
          ->  Hash  (cost=103436.52..103436.52 rows=6709052 width=19)"
                ->  Seq Scan on tabel D d  (cost=0.00..103436.52 rows=6709052 width=19)"
     ->  Hash  (cost=217261.00..217261.00 rows=8000000 width=39)"
          ->  Seq Scan on Tabel_A a (cost=0.00..217261.00 rows=8000000 width=39)"

      

This is the result of Explain analysis:

Update on Tabel_A a  (cost=902288.27..2150690.80 rows=6714762 width=65) (actual time=2815452.997..2815452.997 rows=0 loops=1)
  ->  Hash Join  (cost=902288.27..2150690.80 rows=6714762 width=65) (actual      time=108861.999..214888.780 rows=5252864 loops=1)
        Hash Cond: ((b.adres)::text = a.adres)
        Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs) AND  
  (c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
  (d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
         Rows Removed by Join Filter: 670998
        ->  Hash Left Join  (cost=522527.27..1318059.42 rows=6716471 width=39) (actual time=43138.635..116933.803 rows=6711432 loops=1)"
              Hash Cond: ((b.adres)::text = (d.adres)::text)"
              ->  Hash Right Join  (cost=295916.60..817658.93 rows=6716471 width=29) (actual time=34571.750..99040.256 rows=6710550 loops=1)"
                    Hash Cond: ((c.adres)::text = (b.adres)::text)"
                    ->  Seq Scan on Tabel_C c  (cost=0.00..240642.35 rows=7600735 width=19) (actual time=127.080..59703.935 rows=7595083 loops=1)"
                    ->  Hash  (cost=172605.71..172605.71 rows=6716471 width=19) (actual time=29925.787..29925.787 rows=6709229 loops=1)"
                          Buckets: 2048  Batches: 512  Memory Usage: 678kB"
                          ->  Seq Scan on Tabel_B b  (cost=0.00..172605.71 rows=6716471 width=19) (actual time=0.017..27245.069 rows=6709229 loops=1)"
              ->  Hash  (cost=103436.52..103436.52 rows=6709052 width=19) (actual     time=8566.848..8566.848 rows=6709229 loops=1)"
                    Buckets: 2048  Batches: 512  Memory Usage: 678kB"
                    ->  Seq Scan on Tabel_D d (cost=0.00..103436.52 rows=6709052 width=19) (actual time=0.009..5970.010 rows=6709229 loops=1)"
        ->  Hash  (cost=217261.00..217261.00 rows=8000000 width=39) (actual      time=65721.815..65721.815 rows=8000000 loops=1)"
               Buckets: 2048  Batches: 1024  Memory Usage: 612kB"
              ->  Seq Scan on Tabel_A a  (cost=0.00..217261.00 rows=8000000 width=39) (actual time=0.056..55968.171 rows=8000000 loops=1)"
Total runtime: 2815453.549 ms"

      

Tables A, B, C, and D have all indexes on the address column, on the column that is used to join. There are no other indexes on the tables. Table A has a primary key (id). Tables B, C, and D contain 5 to 7 columns that are not used in this process.

+1


source to share


1 answer


This query should give the same results, but avoids CASE

-- EXPLAIN ANALYZE
UPDATE table_a a
SET rebuilding_costs = drie.rebuilding_costs
FROM (
        SELECT COALESCE(b.address, c.address, d.address) AS address
        , COALESCE(b.rebuilding_costs, c.rebuilding_costs,  d.rebuilding_costs)
            AS rebuilding_costs
        FROM table_b b
        FULL OUTER JOIN table_c c ON c.address = b.address
                AND  c.rebuilding_costs BETWEEN 200001 AND 400000
        FULL OUTER JOIN table_D d ON d.address = b.address
                AND  d.rebuilding_costs BETWEEN 400001 AND 600000
        WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
        ) drie
WHERE a.address = drie.address
AND a.rebuilding_costs <> drie.rebuilding_costs -- Avoid useless updates
        ;

      


UPDATE: Similar approach based on CTE chaining: {/ p>



-- --------------------------------
EXPLAIN ANALYZE
WITH cte_b AS (
        SELECT b.address,  b.rebuilding_costs
        FROM table_b b
        WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
        )
,       cte_c AS (
        SELECT c.address , c.rebuilding_costs
        FROM table_c c
        WHERE  c.rebuilding_costs BETWEEN 200001 AND 400000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = c.address)
        )
,       cte_d AS (
        SELECT d.address , d.rebuilding_costs
        FROM table_d d
        WHERE  d.rebuilding_costs BETWEEN 400001 AND 600000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = d.address)
        AND NOT EXISTS (SELECT * FROM cte_c WHERE cte_c.address = d.address)
        )
,       cte_bcd AS (
        SELECT           cte_b.address,  cte_b.rebuilding_costs FROM cte_b
        UNION ALL SELECT cte_c.address,  cte_c.rebuilding_costs FROM cte_c
        UNION ALL SELECT cte_d.address,  cte_d.rebuilding_costs FROM cte_d
        )
UPDATE table_a a
SET rebuilding_costs = cte_bcd.rebuilding_costs
FROM cte_bcd
WHERE a.address = cte_bcd.address
-- avoid useless updates this way:
AND a.rebuilding_costs <> cte_bcd.rebuilding_costs
-- ,or this way:
-- AND cte_bcd.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs
   ;

      


UPDATE2: CTEs can be slow as they act as optimization barriers. The quick and dirty way is to rewrite tem as (temporary) VIEWs, and reference them instead. This allows the optimizer to shuffle different parts of the query into and out of subqueries, and to combine and reuse them.

CREATE TEMP VIEW cte_b AS (
        SELECT b.address,  b.rebuilding_costs
        FROM table_b b
        WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
        );
CREATE TEMP VIEW        cte_c AS (
        SELECT c.address , c.rebuilding_costs
        FROM table_c c
        WHERE  c.rebuilding_costs BETWEEN 200001 AND 400000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = c.address)
        );
CREATE TEMP VIEW        cte_d AS (
        SELECT d.address , d.rebuilding_costs
        FROM table_d d
        WHERE  d.rebuilding_costs BETWEEN 400001 AND 600000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = d.address)
        AND NOT EXISTS (SELECT * FROM cte_c WHERE cte_c.address = d.address)
        );
CREATE TEMP VIEW        cte_bcd AS (
        SELECT           cte_b.address,  cte_b.rebuilding_costs FROM cte_b
        UNION ALL SELECT cte_c.address,  cte_c.rebuilding_costs FROM cte_c
        UNION ALL SELECT cte_d.address,  cte_d.rebuilding_costs FROM cte_d
        );
EXPLAIN -- ANALYZE
UPDATE table_a a
SET rebuilding_costs = cte_bcd.rebuilding_costs
FROM cte_bcd
WHERE a.address = cte_bcd.address
AND a.rebuilding_costs <> cte_bcd.rebuilding_costs -- avoid useless updates
-- AND a.address < 100000
        ;

      

+1


source







All Articles