Large File Processing - Reading Algorithm Breaks - C #

So I have an algorithm that reads from a (very large, ~ 155 + MB) binary file, parses it according to the spec, and writes the information it needs (in CSV, flat text). It works flawlessly for the first 15.5 million lines of output, which creates a ~ 0.99-1.03 GB CSV file. This is almost 20% away from the binary. After that it breaks as all of a sudden the printed data doesn't show up in the binary at all. I checked the binary, the same pattern continues (data separated into "packages" - see code below). Due to the way it is handled, mem usage never increases (stable ~ 15K). Below is the functional code. This is my algorithm (if so why does it break after 15.5 million lines ?!) ... are there any other implications,which I am not considering due to the large file sizes? Any ideas?

(fyi: each "packet" is 77 bytes long, starting with a "start code" of 3 bytes and ending with a "end code" of 5 bytes - you will see the picture below)

the code edit has been updated based on the suggestions below ... thanks!

private void readBin(string theFile)
{
    List<int> il = new List<int>();
    bool readyForProcessing = false;

    byte[] packet = new byte[77];

    try
    {
        FileStream fs_bin = new FileStream(theFile, FileMode.Open);
        BinaryReader br = new BinaryReader(fs_bin);

        while (br.BaseStream.Position < br.BaseStream.Length && working)
        {
            // Find the first startcode
            while (!readyForProcessing)
            {
                // If last byte of endcode adjacent to first byte of startcod...
                // This never occurs outside of ending/starting so it safe
                if (br.ReadByte() == 0x0a && br.PeekChar() == (char)0x16)
                    readyForProcessing = true;
            }

            // Read a full packet of 77 bytes
            br.Read(packet, 0, packet.Length);

            // Unnecessary I guess now, but ensures packet begins
            // with startcode and ends with endcode
            if (packet.Take(3).SequenceEqual(STARTCODE) &&
                packet.Skip(packet.Length - ENDCODE.Length).SequenceEqual(ENDCODE))
            {
                il.Add(BitConverter.ToUInt16(packet, 3)); //il.ElementAt(0) == 2byte id
                il.Add(BitConverter.ToUInt16(packet, 5)); //il.ElementAt(1) == 2byte semistable
                il.Add(packet[7]); //il.ElementAt(2) == 1byte constant

                for(int i = 8; i < 72; i += 2) //start at 8th byte, get 64 bytes
                    il.Add(BitConverter.ToUInt16(packet, i));

                for (int i = 3; i < 35; i++)
                {
                    sw.WriteLine(il.ElementAt(0) + "," + il.ElementAt(1) +
                        "," + il.ElementAt(2) + "," + il.ElementAt(i));
                }

                il.Clear();
            }
            else
            {
                // Handle "bad" packets
            }
        } // while

        fs_bin.Flush();
        br.Close();                
        fs_bin.Close();
    }
    catch (Exception e)
    {
        MessageBox.Show(e.ToString());
    }
}
      

+2


source to share


2 answers


Your code silently catches any exception that occurs in the while loop and swallows it.

This is bad practice because it masks problems like the one you are working on.

Most likely, one of the methods you call inside the loop (for example int.Parse()

) is throwing an exception because it encounters some problems with the data format (or your assumptions about that format).



As soon as an exception is thrown, the loop that reads the data is discarded because it is no longer located on the write boundary.

There are several things you need to do to make this code more robust:

  • Don't freeze the exception in the startup loop. Deal with them.
  • Do not read data bytes by byte or field by field in a loop. Since your records are of fixed size (77 bytes) - read the entire record into bytes [] and then process it from there. This will help you always read at the edge of the record.
+17


source


  • Don't put an empty shared catch

    block here and just silently catch and continue. You should check and see if you have an actual exception from there too.
  • No function needed byteToHexString

    . Just prefix 0x

    the hexadecimal number and it will perform a binary comparison.

i.e.

if(al[0] == 0x16 && al[1] == 0x3C && al[2] == 0x02)
{
    ...
}

      

  • I don't know what your function is doing doConvert

    (you did not mention this source), but the class BinaryReader

    provides many different functions, one of which is ReadInt16

    . If yours is short

    not stored in encoded format, this should be easier to use than doing a rather convoluted and convoluted conversion. Even if they are encoded, it is still much easier to read byte

    and manipulate them, rather than making multiple calls to the converters to strings.

Edit

You seem to be very liberal in your use of LINQ extension techniques (especially ElementAt

). Each time you call this function, it enumerates your list until it reaches that number. You will have much more efficient code (as well as less verbose) if you just use the built-in indexer on the list.



i.e. al[3]

rather than al.ElementAt(3)

.

Also, you don't need to call Flush

on the input Stream

. Flush

is used to tell a stream to write whatever it has in its write buffer to the OS's main file descriptor. There will be nothing for the input stream.

I would suggest replacing the current call with the sw.WriteLine

following:

sw.WriteLine(BitConverter.ToString(packet));

and see if the data you are expecting in the line where it starts to mess up is what you get.

I would do this:

if (packet.Take(3).SequenceEqual(STARTCODE) &&
    packet.Skip(packet.Length - ENDCODE.Length).SequenceEqual(ENDCODE))
{
    ushort id = BitConverter.ToUInt16(packet, 3);
    ushort semistable = BitConverter.ToUInt16(packet, 5);
    byte contant = packet[7];

    for(int i = 8; i < 72; i += 2)
    {
        il.Add(BitConverter.ToUInt16(packet, i));
    }

    foreach(ushort element in il)
    {
        sw.WriteLine(string.Format("{0},{1},{2},{3}", id, semistable, constant, element);
    }

    il.Clear();
}
else
{
    //handle "bad" packets
}

      

+3


source







All Articles