How can I improve the performance of DataSet.ReadXml if I am using a schema?
I have an ADO DataSet that I load from its XML file via ReadXml. Data and schema are in separate files.
It currently takes about 13 seconds to load this dataset. I can cut this down to 700 milliseconds unless I read the schema of the DataSet and let ReadXml output the schema, but then the resulting DataSet is free of any constraints.
I've tried doing this:
Console.WriteLine("Reading dataset with external schema.");
ds.ReadXmlSchema(xsdPath);
Console.WriteLine("Reading the schema took {0} milliseconds.", sw.ElapsedMilliseconds);
foreach (DataTable dt in ds.Tables)
{
dt.BeginLoadData();
}
ds.ReadXml(xmlPath);
Console.WriteLine("ReadXml completed after {0} milliseconds.", sw.ElapsedMilliseconds);
foreach (DataTable dt in ds.Tables)
{
dt.EndLoadData();
}
Console.WriteLine("Process complete at {0} milliseconds.", sw.ElapsedMilliseconds);
When I do this, reading the schema takes 27ms and reading the DataSet takes 12000+ milliseconds. And this is the time before I call EndLoadData on all data tables.
It's not a huge amount of data - it's about 1.5mb, no nested relationships, and all tables contain two or three columns of 6-30 characters. The only thing I can figure out which is different if I read the diagram in front is that the diagram includes all the unique constraints. But BeginLoadData should turn off restrictions (as well as change notification, etc.). So it shouldn't apply here. (And yes, I tried just setting EnforceConstraints to false.)
I have read a lot of posts about how people improve the load times of DataSets by reading the schema first, rather than outputting the object to the schema. In my case, schema output makes the process about 20 times faster than an explicit schema.
It makes me a little crazy. This DataSet schema is generated outside of the meta information, and I am tempted to write a method that creates it programmatically and simply parses it using the XmlReader. But I would rather not do that.
What am I missing? What else can I do to improve the speed here?
source to share
It's not an answer, for sure (although it's better than nothing I've gotten so far), but after struggling with this issue for a long time, I found it is completely missing when my program is not running inside Visual Studio.
Something I didn't mention before, which is even more puzzling, is that when I loaded another (but relatively large) XML document into the DataSet, the program ran just fine. Now I'm wondering if one of my DataSets has some kind of meta information that Visual Studio checks at runtime and the other doesn't. I dont know.
source to share
I will try to give you a performance comparison between storing data in text files and xml files.
The first function creates two files: one file with 1,000,000 records in text form and one file with 1,000,000 (same data) records in xml. You should first notice the difference in file size: ~ 64 MB (plain text) versus ~ 102 MB (xml file).
void create_files()
{
//create text file with data
StreamWriter sr = new StreamWriter("plain_text.txt");
for(int i=0;i<1000000;i++)
{
sr.WriteLine(i.ToString() + "<SEP>" + "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbb" + i.ToString());
}
sr.Flush();
sr.Close();
//create xml file with data
DataSet ds = new DataSet("DS1");
DataTable dt = new DataTable("T1");
DataColumn c1 = new DataColumn("c1", typeof(int));
DataColumn c2 = new DataColumn("c2", typeof(string));
dt.Columns.Add(c1);
dt.Columns.Add(c2);
ds.Tables.Add(dt);
DataRow dr;
for(int j=0; j< 1000000; j++)
{
dr = dt.NewRow();
dr[0]=j;
dr[1] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbb" + j.ToString();
dt.Rows.Add(dr);
}
ds.WriteXml("xml_text.xml");
}
The second function reads these two files: first it reads plain text in the dictionary (just to simulate real world usage), then it reads the XML file. Both stages are measured in milliseconds (and the results are written to the console):
Start reading a text file in memory
Text file loaded into memory in 7628 milliseconds
Start reading an XML file in memory
XML file loaded into memory in 21018 milliseconds
void read_files()
{
//timers
Stopwatch stw = new Stopwatch();
long milliseconds;
//read text file in a dictionary
Debug.WriteLine("Start read Text file into memory");
stw.Start();
milliseconds = 0;
StreamReader sr = new StreamReader("plain_text.txt");
Dictionary<int, string> dict = new Dictionary<int, string>(1000000);
string line;
string[] sep = new string[]{"<SEP>"};
string [] arValues;
while (sr.EndOfStream!=true)
{
line = sr.ReadLine();
arValues = line.Split(sep,StringSplitOptions.None);
dict.Add(Convert.ToInt32(arValues[0]),arValues[1]);
}
stw.Stop();
milliseconds = stw.ElapsedMilliseconds;
Debug.WriteLine("Text file loaded into memory in " + milliseconds.ToString() + " milliseconds" );
//create xml structure
DataSet ds = new DataSet("DS1");
DataTable dt = new DataTable("T1");
DataColumn c1 = new DataColumn("c1", typeof(int));
DataColumn c2 = new DataColumn("c2", typeof(string));
dt.Columns.Add(c1);
dt.Columns.Add(c2);
ds.Tables.Add(dt);
//read xml file
Debug.WriteLine("Start read XML file into memory");
stw.Restart();
milliseconds = 0;
ds.ReadXml("xml_text.xml");
stw.Stop();
milliseconds = stw.ElapsedMilliseconds;
Debug.WriteLine("XML file loaded into memory in " + milliseconds.ToString() + " milliseconds");
}
Conclusion: The XML file is almost double the size of the text file and loads three times slower than the text file.
XML processing is more convenient (due to the level of abstraction) than plain text, but more CPU / disk consumption.
So, if you have small files and acceptable performance, XML datasets are more than good. But if performance is what you want, I don't know if XML Data is installed (with whatever method is available) faster than regular text files. And basically, it starts from the very first reason: the XML file is larger because it has more tags.
source to share
Another step is to try to read a schema-less dataset and then Concatenate into a typed dataset that has limitations. This way, he has all the data at hand as he builds the indexes used to enforce constraints - perhaps that would be more efficient?
From MSDN :
Usually the Merge method is called at the end of a series of procedures that include checking for changes, reconciling errors, updating the source data with the changes, and finally Updating an existing DataSet.
...
source to share