Redshift: update or insert every row in a column with random data from another table

Question

Redshift: update or insert every row in a column with random data from another table

update testdata.dataset1
   set abcd = (select abc 
               from dataset2
               order by random()
               limit 1
              )

In this case, only one random record from the table is dataset2

filled in all rows of the table dataset1

.

I need to create each row with a random record from table dataset2

to table dataset1

.

Note: there dataset1

may be more dataset2

.

+2

postgresql amazon-redshift

Karthic 06 Aug 17 at 18:02

source to share

1 answer

Stanislav Kralin · Accepted Answer · 2017-08-06T19:32:40+0000

Request 1

You must pass abcd

in your subquery to prevent "optimization".

UPDATE dataset1
    SET abcd = (SELECT abc
                FROM dataset2
                WHERE abcd = abcd
                ORDER BY random()
                LIMIT 1
               );

SQL Fiddle

Request 2

The following query should be faster on regular PostgreSQL.

UPDATE dataset1
    SET abcd = (SELECT abc
                FROM dataset2
                WHERE abcd = abcd
                OFFSET floor(random()*(SELECT COUNT(*) FROM dataset2))
                LIMIT 1
               );

SQL Fiddle

However, as you already reported, this is not the case for Redshift, which is a column store.

Request 3

Retrieving all records from dataset2

in one query will be more efficient than fetching records one by one. Let's test:

UPDATE dataset1 original
SET abcd = fake.abc FROM 
              (SELECT ROW_NUMBER() OVER(ORDER BY random()) AS id, abc FROM dataset2) AS fake
               WHERE original.id % (SELECT COUNT(*) FROM dataset2) = fake.id - 1;

SQL Fiddle

Please note that an integer column id

must exist in dataset1

.
Also, for dataset1.id

that are more than the number of entries in dataset2

, abcd

are predictable.

Request 4

Let's create an integer column fake_id

in dataset1

, pre-fill it with random values, and do a join on dataset1.fake_id = dataset2.id

:

UPDATE dataset1
SET fake_id = floor(random()*(SELECT COUNT(*) FROM dataset2)) + 1;  

UPDATE dataset1
SET abcd = abc
FROM dataset2
WHERE dataset1.fake_id = dataset2.id;

SQL Fiddle

Request 5

If you don't want to add a column fake_id

to dataset1

, calculate fake_id

on the fly:

UPDATE dataset1
SET abcd = abc
FROM (
SELECT with_fake_id.id, dataset2.abc FROM 
(SELECT dataset1.id,  floor(RANDOM()*(SELECT COUNT(*) FROM dataset2) + 1) AS fake_id FROM dataset1) AS with_fake_id
JOIN dataset2 ON with_fake_id.fake_id = dataset2.id ) AS joined
WHERE dataset1.id = joined.id;

SQL Fiddle

Performance

In regular PostgreSQL, query 4 seems to be the most efficient.
I'll try to compare the performance in a test instance of DC1.Large.

Redshift: update or insert every row in a column with random data from another table

More articles: