C # / SQL Server 2016 - Update Approx. 100 million records with hash value

I need to be able to read about 100 million records from a SQL Server 2016 database table and then generate hashes for one or more columns and write those records back to the table.

So far I have tried a couple of solutions which are too slow for our needs. I am testing this on a Dell XPS 15 with an i7-7700HQ processor and 32GB of RAM. First of all I tried to use the T-SQL HASHBYTES () function with the SHA1 hash, but it took over two hours on the 100 million test dataset.

Updating using the C # OleDbReader was even slower, but the bottleneck seemed to be writing records. Now I guessed that I am using SqlBulkCopy to copy changed records to a new temp table, and it is much faster than updating an existing table. But it still takes me 40 minutes to generate all the hashes and it takes several times to write the records back. Ideally, we would like the entire operation to complete within an hour. Does anyone have an idea where I could optimize even more. Here is the code:

using (SqlConnection sourceConnection = new SqlConnection(connectionString))
{
sourceConnection.Open();

SqlDataAdapter adapter = new SqlDataAdapter("SELECT *  FROM [ContosoRetailDW].[dbo].[FactInventory]", sourceConnection);

var dataTable = new System.Data.DataTable();

adapter.FillSchema(dataTable, System.Data.SchemaType.Source);              

dataTable.Columns[2].DataType = typeof(string);
dataTable.Columns[2].MaxLength = 20;

adapter.Fill(dataTable);

for (int i = 0; i < dataTable.Rows.Count; i++)
{
    byte[] toHash = Encoding.UTF8.GetBytes((string)dataTable.Rows[i][2]);
    dataTable.Rows[i][2] = xxHash.CalculateHash(toHash).ToString("X");
}

using (SqlConnection destinationConnection = new SqlConnection(connectionString))
{
    destinationConnection.Open();

    using (SqlBulkCopy bulkCopy = new SqlBulkCopy(destinationConnection.ConnectionString))
    {
        bulkCopy.BatchSize = 100000;
        bulkCopy.DestinationTableName = "FactInventory_Hashed";
        bulkCopy.WriteToServer(dataTable);
    }
}

      

I have already tried playing with mass batch size and I am using a very fast hashing algorithm.

+3


source to share


1 answer


This is a more or less extended comment.

From my point of view, this approach has the following disadvantages:

[1] It moves large amounts of data outside of SQL Server SQL DB -> C# App -> SQL DB

, which generates a lot of network I / O (this is network I / O - true - but still a lot of I / O)

[2] It loads a large amount of data into memory. If this application is running on the same computer as the SQL Server Database Engine, it can generate memory pressure for other applications running on the same computer (including SQL Server here).

[3] It generates two scans on supposedly large tables. The first is SELECT ... FROM ...

and the second is for (...) {...}

.

What would I like to try?

(i) I would import the method xxHash.CalculateHash

as CLR scalar function

(see https://docs.microsoft.com/en-us/sql/relational-databases/clr-integration-database-objects-user-defined-functions/clr-scalar-valued -functions )

CREATE ASSEMBLY CLRxxHash FROM 'D:\....\xxHash.dll';  
GO  

CREATE FUNCTION dbo.xxHash(@Source VARBINARY(8000)) RETURNS INT   
AS EXTERNAL NAME CLRxxHash.......;   
GO 

      

and then I will use the following T-SQL script



USE ContosoRetailDW
GO

INSERT dbo.FactInventory_Hashed (..., HashCol)
SELECT ..., dbo.xxHash(CONVERT(VARCHAR(8000), fi.Column2) AS HashCol
FROM dbo.FactInventory AS fi

      

Warning: This should perform well enough (at least) on a large table if SQL Server used parallelism. Because of the scalar function, there is a risk that SQL Server will generate a sequential plan.

(ii) If the execution plan for the above instruction INSERT ... SELECT

is serial, I would use batches:

One connection will process all lines with IDs

between 1

and1.000.000

INSERT dbo.FactInventory_Hashed (..., HashCol)
SELECT ..., dbo.xxHash(CONVERT(VARCHAR(8000), fi.Column2) AS HashCol
FROM dbo.FactInventory AS fi
WHERE fi.ID BETWEEN 1 AND 1000000

      

another connection will handle everything IDs

between 1.000.001

and2.000.000

INSERT dbo.FactInventory_Hashed (..., HashCol)
SELECT ..., dbo.xxHash(CONVERT(VARCHAR(8000), fi.Column2) AS HashCol
FROM dbo.FactInventory AS fi
WHERE fi.ID BETWEEN 1000001 AND 2000000

      

etc.

+1


source







All Articles