Best way to load large text file in java

I have a text file with a sequence of integers in a line:

47202 1457 51821 59788 
49330 98706 36031 16399 1465
...

      

There are 3 million lines of this format in the file. I need to load this file into memory and extract 5 grams from it and do some statistics. I have a memory limit (8GB RAM). I tried to keep the number of objects I create to a minimum (only has 1 class with 6 float variables and some methods). And each line of this file basically generates the number of objects of this class (proportional to the size of the line in temrs #ofwords). I started to feel that Java is not a good way to do this when C ++ is around.

Edit: Suppose each line produces (n-1) objects of this class. Where n is the number of tokens in this line, separated by a space (i.e. 1457). Therefore, given an average size of 10 words per line, each line displays 9 objects on average. So there will be 9 * 3 * 10 ^ 6 objects. So the memory needed: 9 * 3 * 10 ^ 6 * (8 bytes obj header + 6 x 4 byte floats) + (map (String, Objects) and another map (Integer, ArrayList (Objects))). I need to keep everything in memory because there will be some kind of mathematical optimization after that.

+3


source to share


2 answers


Reading / parsing a file :

The best way to handle large files in any language is to try and DO NOT load them into memory.

In java, have a look at MappedByteBuffer . it allows you to map a file to process memory and access its contents without loading it all into your heap.

You can also try reading the file in turn and discarding each line after reading it - again to avoid storing the entire file in memory at the same time.

Processing received objects



There are several options for working with objects that you produce when parsing:

  • The same as with the file itself - if you can accomplish what you want to accomplish without keeping everything in memory (when streaming the file) this is the best solution. you did not describe the problem you are trying to solve, so I don’t know if this is possible.

  • Compression of some sort - switching from Wrapper objects (Float) to primitives (float), use something like a fly pattern to store your data in giant float [] arrays and only create short-lived objects to access it, find in your some template that allows you to store it more compactly

  • Caching / Unloading - if your data still doesn't fit into memory "on page" to disk. it can be as simple as extending guava to disk or adding to a library like ehcache or the like.

a note on java collections and map features

For small objects, java collections and maps in particular are subject to a large memory penalty (mainly due to the fact that everything is wrapped as objects and the existence of instances of the internal Map.Entry instance). at the expense of a slightly less elegant API, you should probably look at gnu trove if memory is an issue.

+14


source


Optimal would be to hold only integers and line ends.

To do this, one way would be to: convert the file into two files:

  • one binary integer file (4 bytes)
  • one binary file with indices where the next line will run.

To do this, you can use a scanner for reading, and for writing, DataOutputStream + BufferedOutputStream.



Then you can load these two files in primitive type arrays:

int[] integers = new int[(int)integersFile.length() / 4];
int[] lineEnds = new int[(int)lineEndsFile.length() / 4];

      

Reading can be done using MappedByteBuffer.toIntBuffer (). (Then you don't even need arrays, but it will be a bit COBOL, as verbose.)

0


source







All Articles