Redshift: update or insert every row in a column with random data from another table
update testdata.dataset1
set abcd = (select abc
from dataset2
order by random()
limit 1
)
In this case, only one random record from the table is dataset2
filled in all rows of the table dataset1
.
I need to create each row with a random record from table dataset2
to table dataset1
.
Note: there dataset1
may be more dataset2
.
source to share
Request 1
You must pass abcd
in your subquery to prevent "optimization".
UPDATE dataset1
SET abcd = (SELECT abc
FROM dataset2
WHERE abcd = abcd
ORDER BY random()
LIMIT 1
);
Request 2
The following query should be faster on regular PostgreSQL.
UPDATE dataset1
SET abcd = (SELECT abc
FROM dataset2
WHERE abcd = abcd
OFFSET floor(random()*(SELECT COUNT(*) FROM dataset2))
LIMIT 1
);
However, as you already reported, this is not the case for Redshift, which is a column store.
Request 3
Retrieving all records from dataset2
in one query will be more efficient than fetching records one by one. Let's test:
UPDATE dataset1 original
SET abcd = fake.abc FROM
(SELECT ROW_NUMBER() OVER(ORDER BY random()) AS id, abc FROM dataset2) AS fake
WHERE original.id % (SELECT COUNT(*) FROM dataset2) = fake.id - 1;
Please note that an integer column id
must exist in dataset1
.
Also, for dataset1.id
that are more than the number of entries in dataset2
, abcd
are predictable.
Request 4
Let's create an integer column fake_id
in dataset1
, pre-fill it with random values, and do a join on dataset1.fake_id = dataset2.id
:
UPDATE dataset1
SET fake_id = floor(random()*(SELECT COUNT(*) FROM dataset2)) + 1;
UPDATE dataset1
SET abcd = abc
FROM dataset2
WHERE dataset1.fake_id = dataset2.id;
Request 5
If you don't want to add a column fake_id
to dataset1
, calculate fake_id
on the fly:
UPDATE dataset1
SET abcd = abc
FROM (
SELECT with_fake_id.id, dataset2.abc FROM
(SELECT dataset1.id, floor(RANDOM()*(SELECT COUNT(*) FROM dataset2) + 1) AS fake_id FROM dataset1) AS with_fake_id
JOIN dataset2 ON with_fake_id.fake_id = dataset2.id ) AS joined
WHERE dataset1.id = joined.id;
Performance
In regular PostgreSQL, query 4 seems to be the most efficient.
I'll try to compare the performance in a test instance of DC1.Large.
source to share