Any efficient way to parse large text files and store the parsing information?

My goal is to parse text files and store information in appropriate tables.

I have to parse about 100 folders with over 8000 files and only about 20GB in size. When I tried to store the contents of the entire file in a line, an out of memory exception was thrown.

it

 using (StreamReader objStream = new StreamReader(filename))
        {
          string fileDetails = objStream.ReadToEnd();
}

      

Hence, I tried one logic like

     using (StreamReader objStream = new StreamReader(filename))
        {

 // Getting total number of lines in a file
        int fileLineCount = File.ReadLines(filename).Count(); 

        if (fileLineCount < 90000)
           {
            fileDetails = objStream.ReadToEnd();
            fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
            string[] fileInfo = fileDetails.ToString().Split('\n');
            //call respective method for parsing and insertion
           }
        else
          {
            while ((firstLine = objStream.ReadLine()) != null)
             {
               lineCount++;
               fileDetails = (fileDetails != string.Empty) ? string.Concat(fileDetails, "\n", firstLine)
                                    : string.Concat(firstLine);
                if (lineCount == 90000)
                 {
                    fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
                  string[] fileInfo = fileDetails.ToString().Split('\n');
                   lineCount = 0;
                 //call respective method for parsing and insertion
                 }
             }
             //when content is 90057, to parse 57
             if (lineCount < 90000 )
              {
                 string[] fileInfo = fileDetails.ToString().Split('\n');
                 lineCount = 0;
                 //call respective method for parsing and insertion
              }
          }
        }

      

Here 90,000 is the bulk size which is safe to handle without an out of memory exception for my case.

However, it takes more than 2 days to complete. I noticed it was happening because of reading line by line.

Is there a better approach for this?

Thanks in Advance :)

+3


source to share


1 answer


You can use a profiler to determine what's distracting your productivity. In this case, it's obvious: disk access and string concatenation.

  • Don't read the file more than once. Let's take a look at your code. First of all, the line int fileLineCount = File.ReadLines(filename).Count();

    means that you are reading the entire file and discarding what you read. This is bad. Discard yours if (fileLineCount < 90000)

    and keep only else

    .

It almost doesn't matter if you are reading line by line in sequential order, or the entire file, because the read is buffered anyway.



  1. Avoid string concatenation, especially for long strings.

    fileDetails = fileDetails.Replace (Environment.NewLine, "\ n"); string [] fileInfo = fileDetails.ToString (). Split ('\ n');

This is really bad. You are reading the file in turn, why are you doing this replacement / splitting? File.ReadLines()

gives you a set of all lines. Just pass it to your parsing routine.

If you do this correctly, I expect a significant speedup. It can be optimized by reading the files in a separate thread, processing them in the main. But that's another story.

+1


source







All Articles