Fastest way to create trie (JSON) from 4gb file using only 1GB of RAM?

Perhaps I am doing it wrong:

I have 4GB (33 million lines of text) where each line has a line in it.

I am trying to create a trie -> The algorithm is working. The problem is that Node.js has a 1.4GB process memory limit, so the moment I process 5.5M rows it crashes.

To get around this, I tried the following:

Instead of 1 Trie, I create many Tries, each with an alphabet range. For example: aTrie ---> all words starting with bTrie ---> all words starting with b ... etc ...

But the problem is, I still can't keep all the objects in memory when I read the file, so every time I read a line, I load / unload the trie from disk. When there is a change, I delete the old file and write the updated trie from memory to disk.

This is SUPER SLOW! Even on my macbook pro with SSD.

I considered writing this in Java, but then there is the problem of converting JAVA objects to json (same problem using C ++, etc.).

Any suggestions?

+3


source to share


2 answers


Instead of using 26 Tries, you can use a hash function to create an arbitrary number of subtypes. Thus, the amount of data you have to read from disk is limited by the size of your sub-Trie, which you define. Alternatively, you can cache recently used sub-Tries in memory and then save changes to disk asynchronously in the background if IO is still a problem.



0


source


You can increase the memory limit that the node process uses by specifying the parameter below;

ps: size in mb.



node --max_old_space_size=4096

      

See the following sections for more information: https://github.com/thlorenz/v8-flags/blob/master/flags-0.11.md

0


source







All Articles