Find if two datasets are identical, one in .NET and one in SQL Server

I have to find if two datasets are identical without using a primary key, only the data.

If it exists I have to return the primary key, otherwise I need to insert a new row and return the new primary key

Instead of comparing each field separately, since I think it might be slow ????, I came up with this solution

(I'll take the address as an example)

On the SQL Server side persisted

computed column

with index

on it;

-- nice new function in 2012; concat
hashbytes('SHA2_256',
    concat(
        nchar(0x2460),[Address_Type_Id],
        nchar(0x2461),[Street_Name],
        nchar(0x2462),[Municipality],
        nchar(0x2463),[Postal_Zip_Code],
        nchar(0x2464),[Unit],
        nchar(0x2465),[Street_Number],
        nchar(0x2466),[PO_Box],
        nchar(0x2467),[Rural_Route_Number],
        nchar(0x2468),[Street_Type_Id],
        nchar(0x2469),[Country_Type_Id],
        nchar(0x246A),[Country_Division_Type_Id],
        nchar(0x246B),[Country_Division_Other],
        nchar(0x246C),[Direction_Type_Id]
    )
)

      

on the .NET side;

using(var mySHA256 = System.Security.Cryptography.SHA256Managed.Create())
{
    var Value = string.Concat(
            (char)0x2460, PartyAddress.AddressTypeId,
            (char)0x2461, PartyAddress.StreetName,
            (char)0x2462, PartyAddress.Municipality,
            (char)0x2463, PartyAddress.PostalZip,
            (char)0x2464, PartyAddress.Unit,
            (char)0x2465, PartyAddress.StreetNumber,
            (char)0x2466, PartyAddress.POBOX,
            (char)0x2467, PartyAddress.RuralRouteNumber,
            (char)0x2468, PartyAddress.StreetTypeId,
            (char)0x2469, PartyAddress.CountryTypeId,
            (char)0x246A, PartyAddress.CountryDivisionTypeId,
            (char)0x246B, PartyAddress.CountryDivisionOther,
            (char)0x246C, PartyAddress.DirectionTypeId);

    var hashValue = mySHA256.ComputeHash(Encoding.Unicode.GetBytes(Value));
}

      

After that, I can now compare hashValue

and computed column

and it actually works.

My question is, before I start implementing this in MANY other tables, should I be looking at a different, better solution?

EDIT **************

to add another solution, comparing fields by field, since I am using entity framework would be like this

var addressExist = Tombstone.Addresses
                            .FirstOrDefault(x => 
        x.Address_Type_Id == PartyAddress.AddressTypeId &&
        x.Street_Name == PartyAddress.StreetName &&
        x.Municipality == PartyAddress.Municipality &&
        x.Postal_Zip_Code == PartyAddress.PostalZip &&
        x.Unit == PartyAddress.Unit &&
        x.Street_Number == PartyAddress.StreetNumber &&
        x.PO_Box == PartyAddress.POBOX &&
        x.Rural_Route_Number == PartyAddress.RuralRouteNumber &&
        x.Street_Type_Id == PartyAddress.StreetTypeId &&
        x.Country_Type_Id == PartyAddress.CountryTypeId &&
        x.Country_Division_Type_Id == PartyAddress.CountryDivisionTypeId &&
        x.Country_Division_Other == PartyAddress.CountryDivisionOther &&
        x.Direction_Type_Id == PartyAddress.DirectionTypeId);

      

which generate this request

SELECT TOP (1) [Extent1].[Address_Id] AS [Address_Id]
    FROM [dbo].[Address] AS [Extent1]
    WHERE ([Extent1].[Address_Type_Id] = @p__linq__0) AND 
          (([Extent1].[Street_Name] = @p__linq__1) OR (([Extent1].[Street_Name] IS NULL) AND (@p__linq__1 IS NULL))) AND 
          (([Extent1].[Municipality] = @p__linq__2) OR (([Extent1].[Municipality] IS NULL) AND (@p__linq__2 IS NULL))) AND 
          (([Extent1].[Postal_Zip_Code] = @p__linq__3) OR (([Extent1].[Postal_Zip_Code] IS NULL) AND (@p__linq__3 IS NULL))) AND 
          (([Extent1].[Unit] = @p__linq__4) OR (([Extent1].[Unit] IS NULL) AND (@p__linq__4 IS NULL))) AND 
          (([Extent1].[Street_Number] = @p__linq__5) OR (([Extent1].[Street_Number] IS NULL) AND (@p__linq__5 IS NULL))) AND 
          (([Extent1].[PO_Box] = @p__linq__6) OR (([Extent1].[PO_Box] IS NULL) AND (@p__linq__6 IS NULL))) AND 
          (([Extent1].[Rural_Route_Number] = @p__linq__7) OR (([Extent1].[Rural_Route_Number] IS NULL) AND (@p__linq__7 IS NULL))) AND
          (([Extent1].[Street_Type_Id] = @p__linq__8) OR (([Extent1].[Street_Type_Id] IS NULL) AND (@p__linq__8 IS NULL))) AND 
          (([Extent1].[Country_Type_Id] = @p__linq__9) OR (([Extent1].[Country_Type_Id] IS NULL) AND (@p__linq__9 IS NULL))) AND 
          (([Extent1].[Country_Division_Type_Id] = @p__linq__10) OR (([Extent1].[Country_Division_Type_Id] IS NULL) AND (@p__linq__10 IS NULL))) AND 
          (([Extent1].[Country_Division_Other] = @p__linq__11) OR (([Extent1].[Country_Division_Other] IS NULL) AND (@p__linq__11 IS NULL))) AND 
          (([Extent1].[Direction_Type_Id] = @p__linq__12) OR (([Extent1].[Direction_Type_Id] IS NULL) AND (@p__linq__12 IS NULL)))

      

+3


source to share


2 answers


If you think that comparing each field will be slow, consider an alternative solution consisting of:

  • take each field and add char
  • take each feed and concatenate with the other, repeating this process n times (where there are no fields)
  • calculating the SHA256 hash

So you've just done a 2n to 2 x operation (2n + 2n + hash (n)). Also, you create a lot of strings, each adding CPU cycles and memory (strings are immutable in .Net).



You can clearly see which one will be slower.

Update:

  • Since the hash is computed and stored, then you don't get 2n of overhead, but only processing on the .Net side.
  • StringBuilder doesn't magically give you savings, especially for such a small number of operations.
  • Having said that, hashing will probably be faster if you are indexing the hash value on SQL Server. Since you require a generic solution, the difference in hash performance and access to all of these columns can vary. I would suggest running some tests against the worst case scenario to see which one performs better.
  • One caveat about adding char - now you are modifying data. If you can guarantee that the char app you add will never happen in real data, you're fine. Otherwise, you yourself enter the collision probability.
  • Another idea is that you can eliminate the comparison of NULL values ​​since you already know they are identical.
+2


source


If I didn't have to automate this (i.e. you only do this from time to time, and manual comparisons are good enough), then I would consider using a specialized SQL database comparison tool. Then I loaded the .NET DataTable

into a temporary SQL Server database, name them exactly the same as the SQL Server tables they correspond to and ask the tool to compare the temporary DB with the actual DB.

Also, calculating a checksum for each row seems like a reasonable approach.



(Alternative: Not that I found the following idea a good approach, and I never tested it, but in theory you could also define a constraint UNIQUE

on all table columns and then try to insert data from each of DataRow

yours DataTable

into it. If it INSERT

succeeds, then the SQL table Server does not have this record, i.e. your two tables are not identical. They are identical only if each makes a unique constraint and each INSERT

fails, and if both tables DataTable

and SQL Server have the same number of rows.)

0


source







All Articles