Populating random data from another table
update dataset1.test
set column4 = (select column1
from dataset2
order by random()
limit 1
)
I need to update dataset1 of column 4 with each row updating a random record from column of dataset 2. But now already in this above query I only get one random record in all rows of dataset1 and all of it, whatever I want it to be random.
source to share
SETTINGS
Let's start by taking your data tables as follows. Note that I am assuming I dataset1
have a primary key (it can be composite, but for simplicity, let's make it an integer):
CREATE TABLE dataset1
(
id INTEGER PRIMARY KEY,
column4 TEXT
) ;
CREATE TABLE dataset2
(
column1 TEXT
) ;
We fill both tables with sample data
INSERT INTO dataset1
(id, column4)
SELECT
i, 'column 4 for id ' || i
FROM
generate_series(101, 120) AS s(i);
INSERT INTO dataset2
(column1)
SELECT
'SOMETHING ' || i
FROM
generate_series (1001, 1020) AS s(i) ;
Functional check:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count | | ----: | | 20 |
Case 1: number of rows in dataset1 <= rows in dataset2
We'll do a full reshuffle. The values โโfrom dataset2 will be used once and no more than once.
EXPLANATION
To make an update that moves all values โโfrom to column4
at random, we need some intermediate steps.
First, for dataset1
we need to create a list (relation) of tuples (id, rn)
, which is simple:
(id_1, 1), (id_2, 2), (id_3, 3), ... (id_20, 20)
Where id_1
, ..., id_20
are identifiers present on dataset1
. They can be of any type, they do not have to be sequential, and they can be compound.
For dataset2
we need to create another list (column_1,rn)
that looks like this:
(column1_1, 17), (column1_2, 3), (column1_3, 11), ... (column1_20, 15)
In this case, the second column contains all values โโ1 ... 20, but is shuffled.
Once we have two ratios, we have JOIN
them ON ... rn
. In practice, this creates another list of c tuples (id, column1)
, where the pairing is done randomly. We use these pairs to update dataset1
.
REAL QUESTION
All of this can be done (I hope) using some CTE ( WITH
) statements to store intermediate relations:
WITH original_keys AS
(
-- This creates tuples (id, rn),
-- where rn increases from 1 to number or rows
SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
)
, shuffled_data AS
(
-- This creates tuples (column1, rn)
-- where rn moves between 1 and number of rows, but is randomly shuffled
SELECT
column1,
-- The next statement is what *shuffles* all the data
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
shuffled_data
JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
Note that the trick is done with:
row_number() OVER (ORDER BY random()) AS rn
row_number()
a window function that produces as many consecutive numbers as there are rows, starting at 1. These numbers are randomly shuffled as the sentence OVER
takes all the data and sorts it randomly.
CHECKS
We can check again:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count | | ----: | | 20 |
SELECT * FROM dataset1 ;
id | column4 -: | : ------------- 101 | SOMETHING 1016 102 | SOMETHING 1009 103 | SOMETHING 1003 ... 118 | SOMETHING 1012 119 | SOMETHING 1017 120 | SOMETHING 1011
ALTERNATIVE
Note that this can also be accomplished with subqueries by simple substitution instead of CTEs. This can improve performance in some cases:
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
(SELECT
column1,
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
) AS shuffled_data
JOIN
(SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
And again...
SELECT * FROM dataset1;
id | column4 -: | : ------------- 101 | SOMETHING 1011 102 | SOMETHING 1018 103 | SOMETHING 1007 ... 118 | SOMETHING 1020 119 | SOMETHING 1002 120 | SOMETHING 1016
You can check out the whole setup and experiment on the dbfiddle here
NOTE. If you're doing this with very large datasets, don't expect it to be very fast. Shuffling a very large deck of cards is expensive.
Case 2: number of rows in dataset1> rows in dataset2
In this case, the values โโfor column4
can be repeated several times.
The simplest possibility I can think of (maybe not efficient, but understandable) is to create a function random_column1
marked as VOLATILE
:
CREATE FUNCTION random_column1()
RETURNS TEXT
VOLATILE -- important!
LANGUAGE SQL
AS
$$
SELECT
column1
FROM
dataset2
ORDER BY
random()
LIMIT
1 ;
$$ ;
And use it to update:
UPDATE
dataset1
SET
column4 = random_column1();
Thus, some values โโfrom dataset2
may not be used at all, while others will be used more than once.
dbfiddle here
source to share