Parsing variable length descriptors from a byte stream and acting on their type

I am reading from a byte stream that contains a series of variable length descriptors that I represent as various structures / classes in my code. Each descriptor has a fixed length header, along with all other descriptors that are used to identify its type.

Is there a suitable model or pattern that I can use to better parse and represent each descriptor and then take the appropriate action based on its type?

+2


source to share


6 answers


I've written a lot of these parsers.



I recommend that you read the fixed length header and then send the correct constructor to your structures using a simple switch case, passing the fixed header and stream to that constructor so that it can consume the variable portion of the stream.

+9


source


This is a common problem when parsing files. Typically, you read the known part of the descriptor (which, fortunately, is of a fixed length in this case, but not always) and paste it there. I usually use the strategy pattern as I usually expect the system to be generally flexible, but a direct switch or factory might work as well.

Another question is: Do you manage and trust the downstream code? Meaning: Factory / strategy implementation? If you do, you can just give them the stream and the number of bytes you expect them to consume (perhaps setting some debug assertions to make sure they actually read exactly the correct amount).

If you cannot trust the factory / strategy implementation (perhaps you are allowing user code to use custom deserializers) then I would build a wrapper on top of the stream ( example: SubStream

from protobuf-net
) that only allows the expected number of bytes to be used (after that it reports EOF ) and prevents seek / etc operations outside of this block. I would also check the runtime (even in release builds) that enough data was spent, but in this case I would probably just read any unread data, that is, if we expected the code downstream to be consume 20 bytes, but it will only read 12, then skip the next 8 and read our next descriptor.

To expand on this; one draft strategy might have something like:



interface ISerializer {
    object Deserialize(Stream source, int bytes);
    void Serialize(Stream destination, object value);
}

      

You can create a dictionary (or just a list if the number is small) of such serializers on the expected markers, and allow your serializer, and then call the method Deserialize

. If you don't recognize the marker, then (one of):

  • skip the specified number of bytes
  • enter error
  • store extra bytes somewhere in a buffer (allowing unexpected data to be committed both ways)

As a side note to the above - this approach (strategy) is useful if the system is defined at runtime, either through reflection or via a runtime DSL (etc.). If the system is completely predictable at compile time (because it doesn't change or because you are using code generation), then the direct approach switch

may be more appropriate - and you probably don't need additional interfaces, since you can enter the appropriate code directly.

+2


source


One of the key points to remember is if you are reading from a stream and do not find a valid header / message, only throw out the first byte before trying again. Many times I have seen that a whole packet or message is dropped instead, which can result in the loss of valid data.

+2


source


It looks like it might be a task for a Factory Method or perhaps Annotation Factory . Based on the header, you choose a factory method to call and return an object of the appropriate type.

Whether this is better than just adding constructors to a switch statement depends on the complexity and uniformity of the objects being created.

+1


source


I would suggest:

fifo = Fifo.new

while (fd is readable) {
  read everything off the fd and stick it into fifo
  if (the front of the fifo is has a valid header and 
      the fifo is big enough for payload) {

      dispatch constructor, remove bytes from fifo
  }
}

With this method:

  • you can do some error checking on bad payloads and potentially throw bad data away.
  • no data is waiting in the fd read buffer (can be a problem for large payloads).
0


source


If you want this nice OO you can use the visitor pattern in the object hierarchy. The way I did it, it was like this (to identify packets pulled from the network, pretty much the same as what you might need):

  • huge hierarchy of objects, with one parent class

  • each class has a static constructor that registers with its parent, so the parent knows about its direct children (this was C ++, I think this step is unnecessary in languages ​​with good reflex support)

    / li>
  • Each class had a static constructor method that received the remainder of the byte stream and, based on this, decided that it was responsible for processing this data or not

  • When the package arrived, I just passed it to the static constructor method of the main parent class (aka package), which in turn checked all of its children if it was responsible for processing that package, and this happened recursively while one class in the bottom of the hierarchy did not return the returned instance of the class.

  • Each of the static "constructor" methods strips its own header from the byte stream and passes only the payload to the child resources.

The upside of this approach is that you can add new types anywhere in the object hierarchy WITHOUT having to see / modify any other class. It worked remarkably well and well for packages; it was like this:

  • Package
  • EthernetPacket
  • IPPacket
  • UDPPacket, TCPPacket, ICMPPacket
  • ...

Hope you can see this idea.

0


source







All Articles