Best storage format in terms of weight and performance (e.g. Txt, Asc, Bin, etc.)?

Can anyone help me find the best storage format in terms of read / write speed, performance, weight (file size), to store heavy matrices (constant precision floats) to a file (hard disk),

I am using ASCII, text and binary format. And let's say for the same matrix size (e.g. 10000x10000x200) and number precision (e.g. 5 significant digits) I found that the binary format gave the best results overall, followed by ASCII and Text in terms of access / write speed and weight overall (I am not doing any real testing).

With that said, is there a better standard data storage format than binary in my situation? If not, is there any way to optimize the data structure to improve save / read performance?

PS. I can use C, C ++ or Matlab (it doesn't matter which one to me) if better results can be helped.

+3


source to share


3 answers


In general, the binary will be much faster. If you are using floats, you are using 4 bytes per number, not 1 byte per character of the number - so 5.34182 is 4 bytes instead of 7 bytes plus a separator.

Going further, you can probably do better. Your disk does not read data bytes by bytes, but reads data into blocks , and usually you want to avoid reading more blocks than you should. The real reason the binary format is faster is not because it takes up fewer bytes, but in that it takes up fewer blocks (the product takes up fewer bytes). This means that you want to minimize the size on disk, because reading from disk is an order of magnitude slower than reading from RAM - disk accesses are measured in milliseconds, and RAM accesses are measured in microseconds.



So what can you do? If your matrix is sparse , you can only store elements that are non-zero, which will save you a lot of space. So instead of storing each point, store a (index, value) pair for each record. This means that each entry is now 8 bytes instead of 4, but if more than half of the matrix is ​​zero, you are saving a lot of space.

Finally, compression can help here. Of course, more compression means more CPU time to decompress the matrix, but it can also mean faster disc reads. This is where you really need to experiment - at the simple end of the spectrum, Run Length Encoding is easy to implement and often works surprisingly well. This works because if you are storing small integers and "simple" floats, most of the bytes are zero. It also works well if the same number is repeated multiple times, which happens in matrices. I would also recommend checking out more advanced schematics like bzip2, which while a more complex computational process could significantly reduce the size of the disk. Alas, compression tends to be very domain specific, so you should experiment here. What works in one domain doesn't always work in another.

+3


source


Complex issue. And a lot of people have been there in terms of the efficiency of using libraries with ease of use and sharing - have you considered things like hdf5 or NetCDF ? They both have libraries to access C / C ++ as well as bindings to common tools like Matlab, Python, R, ...



However, I have written one-time binary contributors in the past.

+1


source


Yes, 64MB of cache disk won't help in my case.

Unfortunately I am working with very dense matrices (finite elements with strong bindings), and limited-precision binaries seem to give better performance in terms of read / write speed and file size (much lighter) with no compression available.

The text format resulted in significantly larger file sizes compared to Binary, however after compression the resulting file size is the same as binary, but takes significant compression time. The read / write times are also very long.

For 3000x3000 (single): read / write time of binary (68MB) was: 0.05 / 0.23 and 13.8 / 6.5s for text (145MB). For 6000x6000 (single): Binary read / write time (274 MB) was: 0.22 / 0.92 and 56 / 26s for text (583 MB). However, these values ​​may not be accurate as the hard drive may be an important limiting factor for me.

Tests were conducted using the same precision (different combinations), the same die size (3000x3000, 6000x6000, 12000x12000) and the same processor similarity and using the standard Matlab Fwrite , Fprintf , Fread and Fscanf. I could not get a higher size / precision as the hard disk read / write speed was limited and the CPU was at the edge.

0


source







All Articles