Populating random data from another table

Question

Populating random data from another table

update dataset1.test
   set column4 = (select column1 
                 from dataset2
                 order by random()
                 limit 1
                 )

I need to update dataset1 of column 4 with each row updating a random record from column of dataset 2. But now already in this above query I only get one random record in all rows of dataset1 and all of it, whatever I want it to be random.

+3

postgresql amazon-redshift

Karthic 04 Aug 17 at 15:50

source to share

2 answers

Better to reference an external table from a subquery. Then the subquery should be evaluated for each row:

update dataset1.test
   set column4 = (select
        case when dataset1.test.column4 = dataset1.test.column4
             then column1 end
        from dataset2
        order by random()
        limit 1
   )

0

girgen 14 nov. 17 at 16:28

source to share

joanolo · Accepted Answer · 2017-08-05T10:25:22+0000

SETTINGS

Let's start by taking your data tables as follows. Note that I am assuming I dataset1

have a primary key (it can be composite, but for simplicity, let's make it an integer):

CREATE TABLE dataset1
(
     id INTEGER PRIMARY KEY,
     column4 TEXT
) ;

CREATE TABLE dataset2
(
    column1 TEXT
) ;

We fill both tables with sample data

INSERT INTO dataset1
    (id, column4)
SELECT
    i, 'column 4 for id ' || i
FROM
    generate_series(101, 120) AS s(i);

INSERT INTO dataset2
    (column1)
SELECT
    'SOMETHING ' || i
FROM 
    generate_series (1001, 1020) AS s(i) ;

Functional check:

SELECT count(DISTINCT column4) FROM dataset1 ;

| count |
| ----: |
| 20 |

Case 1: number of rows in dataset1 <= rows in dataset2

We'll do a full reshuffle. The values from dataset2 will be used once and no more than once.

EXPLANATION

To make an update that moves all values from to column4

at random, we need some intermediate steps.

First, for dataset1

we need to create a list (relation) of tuples (id, rn)

, which is simple:

(id_1,   1),
(id_2,   2),
(id_3,   3),
...
(id_20, 20)

Where id_1

, ..., id_20

are identifiers present on dataset1

. They can be of any type, they do not have to be sequential, and they can be compound.

For dataset2

we need to create another list (column_1,rn)

that looks like this:

(column1_1,  17),
(column1_2,   3),
(column1_3,  11),
...
(column1_20, 15)

In this case, the second column contains all values 1 ... 20, but is shuffled.

Once we have two ratios, we have JOIN

them ON ... rn

. In practice, this creates another list of c tuples (id, column1)

, where the pairing is done randomly. We use these pairs to update dataset1

.

REAL QUESTION

All of this can be done (I hope) using some CTE ( WITH

) statements to store intermediate relations:

WITH original_keys AS
(
    -- This creates tuples (id, rn), 
    -- where rn increases from 1 to number or rows
    SELECT 
        id, 
        row_number() OVER  () AS rn
    FROM 
        dataset1
)
, shuffled_data AS
(
    -- This creates tuples (column1, rn)
    -- where rn moves between 1 and number of rows, but is randomly shuffled
    SELECT 
        column1,
        -- The next statement is what *shuffles* all the data
        row_number() OVER  (ORDER BY random()) AS rn
    FROM 
        dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
    dataset1
SET
    column4 = shuffled_data.column1
FROM
    shuffled_data
    JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
    dataset1.id = original_keys.id ;

Note that the trick is done with:

row_number() OVER (ORDER BY random()) AS rn

row_number()

a window function that produces as many consecutive numbers as there are rows, starting at 1. These numbers are randomly shuffled as the sentence OVER

takes all the data and sorts it randomly.

CHECKS

We can check again:

SELECT count(DISTINCT column4) FROM dataset1 ;

| count |
| ----: |
| 20 |

SELECT * FROM dataset1 ;

id | column4       
-: | : -------------
101 | SOMETHING 1016
102 | SOMETHING 1009
103 | SOMETHING 1003
...
118 | SOMETHING 1012
119 | SOMETHING 1017
120 | SOMETHING 1011

ALTERNATIVE

Note that this can also be accomplished with subqueries by simple substitution instead of CTEs. This can improve performance in some cases:

UPDATE
    dataset1
SET
    column4 = shuffled_data.column1
FROM
    (SELECT 
        column1,
        row_number() OVER  (ORDER BY random()) AS rn
    FROM 
        dataset2
    ) AS shuffled_data
    JOIN 
    (SELECT 
        id, 
        row_number() OVER  () AS rn
    FROM 
        dataset1
    ) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
    dataset1.id = original_keys.id ;

And again...

SELECT * FROM dataset1;

id | column4       
-: | : -------------
101 | SOMETHING 1011
102 | SOMETHING 1018
103 | SOMETHING 1007
...
118 | SOMETHING 1020
119 | SOMETHING 1002
120 | SOMETHING 1016

You can check out the whole setup and experiment on the dbfiddle here

NOTE. If you're doing this with very large datasets, don't expect it to be very fast. Shuffling a very large deck of cards is expensive.

Case 2: number of rows in dataset1> rows in dataset2

In this case, the values for column4

can be repeated several times.

The simplest possibility I can think of (maybe not efficient, but understandable) is to create a function random_column1

marked as VOLATILE

:

CREATE FUNCTION random_column1() 
    RETURNS TEXT
    VOLATILE      -- important!
    LANGUAGE SQL
AS
$$
    SELECT
        column1
    FROM
        dataset2
    ORDER BY
        random()
    LIMIT
        1 ;
$$ ;

And use it to update:

UPDATE
    dataset1
SET
    column4 = random_column1();

Thus, some values from dataset2

may not be used at all, while others will be used more than once.

dbfiddle here

Populating random data from another table

Case 1: number of rows in dataset1 <= rows in dataset2

Case 2: number of rows in dataset1> rows in dataset2

More articles: