Any efficient way to parse large text files and store the parsing information?
My goal is to parse text files and store information in appropriate tables.
I have to parse about 100 folders with over 8000 files and only about 20GB in size. When I tried to store the contents of the entire file in a line, an out of memory exception was thrown.
it
using (StreamReader objStream = new StreamReader(filename))
{
string fileDetails = objStream.ReadToEnd();
}
Hence, I tried one logic like
using (StreamReader objStream = new StreamReader(filename))
{
// Getting total number of lines in a file
int fileLineCount = File.ReadLines(filename).Count();
if (fileLineCount < 90000)
{
fileDetails = objStream.ReadToEnd();
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
//call respective method for parsing and insertion
}
else
{
while ((firstLine = objStream.ReadLine()) != null)
{
lineCount++;
fileDetails = (fileDetails != string.Empty) ? string.Concat(fileDetails, "\n", firstLine)
: string.Concat(firstLine);
if (lineCount == 90000)
{
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
lineCount = 0;
//call respective method for parsing and insertion
}
}
//when content is 90057, to parse 57
if (lineCount < 90000 )
{
string[] fileInfo = fileDetails.ToString().Split('\n');
lineCount = 0;
//call respective method for parsing and insertion
}
}
}
Here 90,000 is the bulk size which is safe to handle without an out of memory exception for my case.
However, it takes more than 2 days to complete. I noticed it was happening because of reading line by line.
Is there a better approach for this?
Thanks in Advance :)
source to share
You can use a profiler to determine what's distracting your productivity. In this case, it's obvious: disk access and string concatenation.
- Don't read the file more than once. Let's take a look at your code. First of all, the line
int fileLineCount = File.ReadLines(filename).Count();
means that you are reading the entire file and discarding what you read. This is bad. Discard yoursif (fileLineCount < 90000)
and keep onlyelse
.
It almost doesn't matter if you are reading line by line in sequential order, or the entire file, because the read is buffered anyway.
-
Avoid string concatenation, especially for long strings.
fileDetails = fileDetails.Replace (Environment.NewLine, "\ n"); string [] fileInfo = fileDetails.ToString (). Split ('\ n');
This is really bad. You are reading the file in turn, why are you doing this replacement / splitting? File.ReadLines()
gives you a set of all lines. Just pass it to your parsing routine.
If you do this correctly, I expect a significant speedup. It can be optimized by reading the files in a separate thread, processing them in the main. But that's another story.
source to share