How to update each row of a table with a random row from another table
I am creating my first script identity id and am running into problems with my approach.
I have a table dbo.pseudonyms
, a column is firstname
filled with 200 rows of data. Each row in this 200 row column has a value (there are none). This table also has a column id
(int, primary key, not empty) numbered 1-200.
What I want to do is, in one of the statements, repopulate my entire table USERS
with firstname
data randomly selected for each row from the table pseudonyms
.
To generate a random number for the selection, I use ABS(Checksum(NewId())) % 200
. Every time I do SELECT ABS(Checksum(NewId())) % 200
, I get a numeric value in the range I am looking for, just fine with no intermittent behavior.
HOWEVER, when I use this formula in the following expression:
SELECT pn.firstname
FROM DeIdentificationData.dbo.pseudonyms pn
WHERE pn.id = ABS(Checksum(NewId())) % 200
I am getting VERY intermittent results. I would say that about 30% of the results return a single name selected from the table (this is the expected result), about 30% return with more than one result (which is unclear, there are no duplicate column values id
), and about 30% with NULL (although firstname
there are no blank lines in the column )
I have been looking for this specific issue for a long time but haven't figured it out yet. I guess the problem is with the use of this formula as a pointer, but I would be at a loss how to do it otherwise.
Thoughts?
source to share
Why is your query in the question returning unexpected results
The original request is fetched from Pseudonyms
. The server looks at each row of the table, selects ID
from this row, generates a random number, compares the generated number with ID
.
If a randomly generated number for a particular row is the same as ID
this row, that row is returned in the result set. It is possible that the randomly generated number will never be the same as ID
, and also that the generated number has matched multiple times with ID
.
A little more detail:
- The server fetches the line with
ID=1
. - Generates a random number, say
25
. Why not? A decent random number. - Is there
1 = 25
? No => This string is not returned. - The server fetches the line with
ID=2
. - Generates a random number for example
125
. Why not? A decent random number. - Is there
2 = 125
? No => This string is not returned. - Etc...
Here is a complete SQL Fiddle solution
Sample data
DECLARE @VarPseudonyms TABLE (ID int IDENTITY(1,1), PseudonymName varchar(50) NOT NULL);
DECLARE @VarUsers TABLE (ID int IDENTITY(1,1), UserName varchar(50) NOT NULL);
INSERT INTO @VarUsers (UserName)
SELECT TOP(1000)
'UserName' AS UserName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
INSERT INTO @VarPseudonyms (PseudonymName)
SELECT TOP(200)
'PseudonymName'+CAST(ROW_NUMBER() OVER(ORDER BY sys.all_objects.object_id) AS varchar) AS PseudonymName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
The table Users
has 1000 rows with the same UserName
for each row. The table Pseudonyms
has 200 rows with different ones PseudonymNames
:
SELECT * FROM @VarUsers;
ID UserName
-- --------
1 UserName
2 UserName
...
999 UserName
1000 UserName
SELECT * FROM @VarPseudonyms;
ID PseudonymName
-- -------------
1 PseudonymName1
2 PseudonymName2
...
199 PseudonymName199
200 PseudonymName200
First try
I tried the direct approach first. For each line in, Users
I want to get one random line from Pseudonyms
:
SELECT
U.ID
,U.UserName
,CA.PseudonymName
FROM
@VarUsers AS U
CROSS APPLY
(
SELECT TOP(1)
P.PseudonymName
FROM @VarPseudonyms AS P
ORDER BY CRYPT_GEN_RANDOM(4)
) AS CA
;
It turns out the optimizer is too smart and this generated a random one, but the same PseudonymName
for each User
, which I didn't expect:
ID UserName PseudonymName
1 UserName PseudonymName181
2 UserName PseudonymName181
...
999 UserName PseudonymName181
1000 UserName PseudonymName181
So, I modified this approach a bit and first generated a random number for each line in Users
. I then used the generated number to find Pseudonym
with this ID
for each line in Users
, using CROSS APPLY
.
CTE_Users
has an extra column with a random number from 1 to 200. In CTE_Joined
we select a row from Pseudonyms
for each User
. Finally, the UPDATE
original table Users
.
Final decision
WITH
CTE_Users
AS
(
SELECT
U.ID
,U.UserName
,1 + 200 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) AS rnd
FROM @VarUsers AS U
)
,CTE_Joined
AS
(
SELECT
CTE_Users.ID
,CTE_Users.UserName
,CA.PseudonymName
FROM
CTE_Users
CROSS APPLY
(
SELECT P.PseudonymName
FROM @VarPseudonyms AS P
WHERE P.ID = CAST(CTE_Users.rnd AS int)
) AS CA
)
UPDATE CTE_Joined
SET UserName = PseudonymName;
results
SELECT * FROM @VarUsers;
ID UserName
1 PseudonymName41
2 PseudonymName132
3 PseudonymName177
...
998 PseudonymName60
999 PseudonymName141
1000 PseudonymName157
source to share