Does the number of fields in a table affect performance even if not specified?
I am reading and parsing CSV files into a SQL Server 2008 database. This process uses a common CSV parser for all files.
The CSV parser puts the parsed fields into a general field import table (F001 VARCHAR (MAX) NULL, F002 VARCHAR (MAX) NULL, Fnnn ...), which then navigates to the real tables using SQL code that knows which parses the field (Fnnn) , goes to which field in the destination table. Therefore, only the fields that are copied are referenced once in the table. Some files can be quite large (one million lines).
The question is, does the number of fields in a table affect performance or memory usage? Even if most of the fields are not specified. The only operations performed on the field import tables are INSERT and then SELECT to move data to another table, there is no JOIN or WHERE in the field data.
I currently have three field import tables: one with 20 fields, one with 50 fields, and one with 100 fields (this is the maximum number of fields I have encountered so far). There is currently the logic of using the smallest possible file.
I would like to make this process more general and have one table with 1000 fields (I know the limit of 1024 columns). And yes, some of the scheduled files to be processed (from third parties) will be in the 900-1000 field range.
Most files will have less than 50 fields.
At this point, working with the existing three field import tables (plus scheduled tables for more fields (200,500,000?)) Becomes a logical nightmare in the code, and working with one table will solve a lot of problems if I don't give up the great performance.
source to share
As was correctly pointed out in the comments, even if your table has 1000 columns, but most of them NULL
, this shouldn't have a big performance impact since NULLs
it won't waste a lot of space.
You mentioned that you can have real data with 900-1000 non-NULL columns. If you plan to import such files, you may run into another SQL Server limitation. Yes, the maximum number of columns in a table is 1024, but there is a limit of 8060 bytes per row . If your columns are varchar (max), then each such column will consume 24 bytes out of 8060 in the actual row, and the rest of the data will be shifted in a row:
SQL Server supports row overflow store, which allows variable length columns to be pushed in a row. Only the 24-byte root is stored in the master record for variable-length columns preempted from the row; because of this, the effective row limit is higher than in previous releases of SQL Server. For more information, see "8KB Row Overflow Data Exceeded" in SQL Server Books Online.
So in practice you can only have a table with 8060 / 24 = 335
non-NULL nvarchar (max) columns . (Strictly speaking, even slightly less, there are other headers.)
There are so-called wide tables , which can contain up to 30,000 columns, but the maximum size of a wide row table is 8019 bytes. Thus, they will not help you in this case.
source to share
First, to answer the question as stated:
Does the number of rows in a table affect performance even if not referenced?
-
If the fields are fixed (* INT, * MONEY, DATE / TIME / DATETIME / etc, UNIQUEIDENTIFIER, etc.) and the field is not marked as
SPARSE
or compression is not enabled (both started in SQL Server 2008) then the full size of the field will be taken (even ifNULL
), and this affects performance even if the fields are not in the SELECT list. -
If the fields are variable length and NULL (or empty), then they just take up a little space in the page header.
-
In terms of space in general, is this table a heap (no clustered index) or clustered? And how do you clear the table for each new import? If it's a heap and you're just doing
DELETE
, then it might not get rid of all the unused pages. You would know if there is a problem if the occupation occupies even 0 lines when executedsp_spaceused
. Proposals 2 and 3 below would naturally not have such a problem.
Now, some ideas:
-
Have you considered using SSIS to dynamically manage this setting?
-
Since you seem to have a single threaded process, why not create a global temporary table at the start of the process every time? Or, start and re-create the real table in
tempdb
? In any case, if you know the purpose, you can even dynamically create this import table with the target field names and data types. Even if the CSV importer is not aware of the destination, at the beginning of the process you can call a proc who would know about the destination, can create a "temp" table, and then the importer can still import altogether into the standard table name with no fields set. rather than errors if the fields in the table are NULLable and at least equal to the number of columns in the file. -
Is there incoming CSV data, embedded data, quotes and / or separators? Are you manipulating the data between the staging table and the target table? It would be possible to dynamically import directly into the destination table using the correct datatypes, but no transit manipulation. Another option does this in SQLCLR. You can write a stored procedure to open the file and spit out the split fields by doing
INSERT INTO...EXEC
. Or, if you don't want to write your own, have a look at the SQL # SQLCLR library, specifically the stored procedure.ThisFile_SplitIntoFields
proc is only available in Full / paid-for version and I am the creator of SQL #, but it seems to be perfect for this situation ... -
Considering that:
- all fields are imported as text
- the names and types of the target fields are known
- the number of fields differs between target tables
what about one XML field and import each row as a single level document with each field
<F001>
,<F002>
etc.? By doing this, you don't have to worry about the number of fields or have any fields that are not being used. And in fact, since the names of the target fields are known to this process, you can even use those names to refer to the elements in the XML document for each line. So the lines might look like this:ID LoadFileID ImportLine 1 1 <row><FirstName>Bob</FirstName><LastName>Villa</LastName></row> 2 1 <row><Number>555-555-5555</Number><Type>Cell</Type></row>
Yes, the data itself takes up more space than the current VARCHAR (MAX) fields, both due to the fact that XML is double-byte and inherent in massive element tags. But then you are not locked into any physical structure. And just looking at the data will make it easier to spot problems as you will be looking for real field names instead of F001, F002, etc.
-
As far as at least speeding up the process of reading the file, splitting fields and inserting, you should use Table-Valued Parameters (TVPs) to stream data to the import table. I have several answers here showing different implementations of the method, differing mainly on the data source (file and collection already in memory, etc.):
source to share