Find if two datasets are identical, one in .NET and one in SQL Server
I have to find if two datasets are identical without using a primary key, only the data.
If it exists I have to return the primary key, otherwise I need to insert a new row and return the new primary key
Instead of comparing each field separately, since I think it might be slow ????, I came up with this solution
(I'll take the address as an example)
On the SQL Server side persisted
computed column
with index
on it;
-- nice new function in 2012; concat
hashbytes('SHA2_256',
concat(
nchar(0x2460),[Address_Type_Id],
nchar(0x2461),[Street_Name],
nchar(0x2462),[Municipality],
nchar(0x2463),[Postal_Zip_Code],
nchar(0x2464),[Unit],
nchar(0x2465),[Street_Number],
nchar(0x2466),[PO_Box],
nchar(0x2467),[Rural_Route_Number],
nchar(0x2468),[Street_Type_Id],
nchar(0x2469),[Country_Type_Id],
nchar(0x246A),[Country_Division_Type_Id],
nchar(0x246B),[Country_Division_Other],
nchar(0x246C),[Direction_Type_Id]
)
)
on the .NET side;
using(var mySHA256 = System.Security.Cryptography.SHA256Managed.Create())
{
var Value = string.Concat(
(char)0x2460, PartyAddress.AddressTypeId,
(char)0x2461, PartyAddress.StreetName,
(char)0x2462, PartyAddress.Municipality,
(char)0x2463, PartyAddress.PostalZip,
(char)0x2464, PartyAddress.Unit,
(char)0x2465, PartyAddress.StreetNumber,
(char)0x2466, PartyAddress.POBOX,
(char)0x2467, PartyAddress.RuralRouteNumber,
(char)0x2468, PartyAddress.StreetTypeId,
(char)0x2469, PartyAddress.CountryTypeId,
(char)0x246A, PartyAddress.CountryDivisionTypeId,
(char)0x246B, PartyAddress.CountryDivisionOther,
(char)0x246C, PartyAddress.DirectionTypeId);
var hashValue = mySHA256.ComputeHash(Encoding.Unicode.GetBytes(Value));
}
After that, I can now compare hashValue
and computed column
and it actually works.
My question is, before I start implementing this in MANY other tables, should I be looking at a different, better solution?
EDIT **************
to add another solution, comparing fields by field, since I am using entity framework would be like this
var addressExist = Tombstone.Addresses
.FirstOrDefault(x =>
x.Address_Type_Id == PartyAddress.AddressTypeId &&
x.Street_Name == PartyAddress.StreetName &&
x.Municipality == PartyAddress.Municipality &&
x.Postal_Zip_Code == PartyAddress.PostalZip &&
x.Unit == PartyAddress.Unit &&
x.Street_Number == PartyAddress.StreetNumber &&
x.PO_Box == PartyAddress.POBOX &&
x.Rural_Route_Number == PartyAddress.RuralRouteNumber &&
x.Street_Type_Id == PartyAddress.StreetTypeId &&
x.Country_Type_Id == PartyAddress.CountryTypeId &&
x.Country_Division_Type_Id == PartyAddress.CountryDivisionTypeId &&
x.Country_Division_Other == PartyAddress.CountryDivisionOther &&
x.Direction_Type_Id == PartyAddress.DirectionTypeId);
which generate this request
SELECT TOP (1) [Extent1].[Address_Id] AS [Address_Id]
FROM [dbo].[Address] AS [Extent1]
WHERE ([Extent1].[Address_Type_Id] = @p__linq__0) AND
(([Extent1].[Street_Name] = @p__linq__1) OR (([Extent1].[Street_Name] IS NULL) AND (@p__linq__1 IS NULL))) AND
(([Extent1].[Municipality] = @p__linq__2) OR (([Extent1].[Municipality] IS NULL) AND (@p__linq__2 IS NULL))) AND
(([Extent1].[Postal_Zip_Code] = @p__linq__3) OR (([Extent1].[Postal_Zip_Code] IS NULL) AND (@p__linq__3 IS NULL))) AND
(([Extent1].[Unit] = @p__linq__4) OR (([Extent1].[Unit] IS NULL) AND (@p__linq__4 IS NULL))) AND
(([Extent1].[Street_Number] = @p__linq__5) OR (([Extent1].[Street_Number] IS NULL) AND (@p__linq__5 IS NULL))) AND
(([Extent1].[PO_Box] = @p__linq__6) OR (([Extent1].[PO_Box] IS NULL) AND (@p__linq__6 IS NULL))) AND
(([Extent1].[Rural_Route_Number] = @p__linq__7) OR (([Extent1].[Rural_Route_Number] IS NULL) AND (@p__linq__7 IS NULL))) AND
(([Extent1].[Street_Type_Id] = @p__linq__8) OR (([Extent1].[Street_Type_Id] IS NULL) AND (@p__linq__8 IS NULL))) AND
(([Extent1].[Country_Type_Id] = @p__linq__9) OR (([Extent1].[Country_Type_Id] IS NULL) AND (@p__linq__9 IS NULL))) AND
(([Extent1].[Country_Division_Type_Id] = @p__linq__10) OR (([Extent1].[Country_Division_Type_Id] IS NULL) AND (@p__linq__10 IS NULL))) AND
(([Extent1].[Country_Division_Other] = @p__linq__11) OR (([Extent1].[Country_Division_Other] IS NULL) AND (@p__linq__11 IS NULL))) AND
(([Extent1].[Direction_Type_Id] = @p__linq__12) OR (([Extent1].[Direction_Type_Id] IS NULL) AND (@p__linq__12 IS NULL)))
source to share
If you think that comparing each field will be slow, consider an alternative solution consisting of:
- take each field and add char
- take each feed and concatenate with the other, repeating this process n times (where there are no fields)
- calculating the SHA256 hash
So you've just done a 2n to 2 x operation (2n + 2n + hash (n)). Also, you create a lot of strings, each adding CPU cycles and memory (strings are immutable in .Net).
You can clearly see which one will be slower.
Update:
- Since the hash is computed and stored, then you don't get 2n of overhead, but only processing on the .Net side.
- StringBuilder doesn't magically give you savings, especially for such a small number of operations.
- Having said that, hashing will probably be faster if you are indexing the hash value on SQL Server. Since you require a generic solution, the difference in hash performance and access to all of these columns can vary. I would suggest running some tests against the worst case scenario to see which one performs better.
- One caveat about adding char - now you are modifying data. If you can guarantee that the char app you add will never happen in real data, you're fine. Otherwise, you yourself enter the collision probability.
- Another idea is that you can eliminate the comparison of NULL values ββsince you already know they are identical.
source to share
If I didn't have to automate this (i.e. you only do this from time to time, and manual comparisons are good enough), then I would consider using a specialized SQL database comparison tool. Then I loaded the .NET DataTable
into a temporary SQL Server database, name them exactly the same as the SQL Server tables they correspond to and ask the tool to compare the temporary DB with the actual DB.
Also, calculating a checksum for each row seems like a reasonable approach.
(Alternative: Not that I found the following idea a good approach, and I never tested it, but in theory you could also define a constraint UNIQUE
on all table columns and then try to insert data from each of DataRow
yours DataTable
into it. If it INSERT
succeeds, then the SQL table Server does not have this record, i.e. your two tables are not identical. They are identical only if each makes a unique constraint and each INSERT
fails, and if both tables DataTable
and SQL Server have the same number of rows.)
source to share