Magic Byte Choice Least Likely For Real Data

Hopefully this isn't too stubborn for SO; he may not have a good answer.

In the part of the library I am writing, I have a byte array that is populated with values ​​provided by the user. These values ​​can be of type Float, Double, Int (different sizes), etc. With the binary representations you might expect from C. That's all there is to say about values.

I have an opportunity for optimization: I can initialize my byte array with a byte MAGIC

, and then when none of the bytes of the user-supplied value are equal MAGIC

, I can take the fast path, otherwise I have to go the slow path.

So my question is, what is the principal way to select my magic byte so that it cannot appear in the (variably-encoded and distributed) data I receive?

Part of my question, I suppose, is there something like Benford's law that can tell me something about the distribution of bytes in many kinds of data.

+3


source to share


1 answer


Capture real world data from a diverse set of inputs to be used by your library applications.

Write a concise and messy program to parse a dataset. It sounds like you want to know which bytes are most often completely excluded. Thus, the program output will indicate, for each byte value, how many inputs it does not contain.



This is not the same as the least significant byte. When analyzing data, you have to be careful to know exactly what you are measuring!

Use analysis to define your architecture. If the byte never appears, you can skip the optimization entirely.

+2


source







All Articles