Most efficient way to compute hash or checksum for a row with many columns?

I have a scenario where I need to check if the rows in the target database need to be updated from the source database. The source data is actually a view, and the data from that view is accumulated into the target table. Since the source view collects / collapses / collapses data from multiple base tables, we really don't have a good way to change the schema to support change tracking, so I thought about calculating the hash of each row of data and included that as part of the view. We can then compare the hash value in the destination table to see if there is a difference and update accordingly.

I know:

CHECKSUM
BINARY_CHECKSUM
HASHYBYTES

      

functions. Either CHECKSUM () or BINARY_CHECKSUM () seems to be the best option, but I'm not sure how well it would work on a 50 column and million + row view. I also know that the generated checksums / hashes may be different even after editing, but that's okay in this case.

So the question is: is hash / checksum the right thing to do for this, and if so, which function is better to use? Or is there another, better way to approach the problem completely?

(Oh, now on SQL Server 2005, but we'll be moving to 2008R2 soon if that helps.)

+3


source to share


1 answer


I don't know what I would CHECKSUM

really trust . I have seen many cases where people have documented that two different lines caused a collision. Do you just want to know that the line has changed (or does not exist yet at the destination)? Have you canceled the opportunity to use ROWVERSION

? Are you potentially updating data in both places?



Since you are moving to SQL Server 2008 R2 soon, have you thought about other methods that already exist, such as Change Tracking or Change Data Collection ? ( Comparison here .) There are other ways to deal with this problem, which are not concerned with caring about which lines have changed, but that depends on your end goal. On the old system I was working with, we would make major changes to the data in a separate schema, and then play switchcheroo when the data came in. Of course all the data was updated at the source and it was okay for the destination to be a few minutes behind. But this prevented delta problems between source and destination from occurring.

+2


source







All Articles