Caching tar in jvm for faster file IO?

I am working on a Java application that uses thousands of small files to generate artifacts in response to requests. I think our system could see performance improvements if we could match these files in memory rather than working all over the disk to find them all the time.

I have heard about mmap on linux and my basic understanding of this concept is that when a file is read from disk, the contents of the file are cached somewhere in memory for faster later access. What I mean is similar to this idea, except that I would like to read the entire set of mmapable files into memory as my web app is initializing for minimal request time responses.

One aspect of my thoughts is that we will probably load files into jvm memory faster if they were all mounted and somehow mounted in the JVM as a virtual filesystem. As it stands, it might take a few minutes for our current implementation to go through a set of source files and just figure out what is on disk. This is because we are essentially doing file statistics for over 300,000 files.

I found an Apache VFS project that can read information from a tar file, but I'm not sure about their documentation, if you can specify something like "also, read all tar in memory and keep it there."

We're talking about a multi-threaded environment that uses artifacts that typically combine about 100 different files from a complete set of 300,000+ source files to make one answer. So regardless of the virtual file system solution, it must be thread safe and efficient. We're only talking about reading files here, not writing.

Also, we run a 64-bit OS with 32 gigabytes of RAM, our 300,000 files take up about 1.5 to 2.5 gigabytes of space. We are confidently reading a 2.5GB file into memory much faster than 300K small files of several kilobytes.

Thanks for typing!

  • Jason
+1


source to share


8 answers


You can try putting all the files in a JAR and putting them on the classpath. Java uses some built-in tricks to make reading from a JAR file very fast. This will also keep a directory of all files in RAM, so you won't have to go to disk to find the file (which happens before you can load it).

The JVM will not load the entire JAR into RAM at once, and you probably don't want to, because your computer will start replacing. But it will be able to find fragments quickly, because it will keep the file open all the time, and therefore you will not waste time opening / closing the file share.

Also, since you use this single file all the time, chances are the OS will keep it in file caches for longer.



Finally, you can try to compress the JAR. While this may seem like a bad idea, you should try. If small files compress very well, the time to decompress with current CPUs is much lower than the time to read data from disk. If you don't need to store intermediate data anywhere, you can transfer uncompressed data to the client without having to write to a file (which can screw up the whole idea). The downside to this is that it does consume cpu cycles and if your cpu is busy (just check with some boot tool if it is above 20% then you will lose) then you will make the whole process slower.

However, when you use the HTTP protocol, you can tell the client that you are sending compressed data! This way you don't need to decompress the data and you can download very small files.

The main disadvantage of the JAR solution: you cannot replace the JAR as long as the server is running. Therefore, replacing the file means that you have to reboot the server.

+1


source


If you have 300,000 files that you need to get quickly, you can use a database, not a relational one, but just one key value like http://www.space4j.org/ . This will not help your start time, but it might speed up while running.



+1


source


To clarify, mmap()

Unix-like systems will not allow you to access files as such; it just makes the contents of the file available in memory, like memory. You cannot use open()

to further open any contained files. There is no such thing as a " mmap()

capable set of files".

Can't you just add a pass that initially loads all of your "templates" and then quickly finds them based on something simple, like a hash by the name of each? This will allow you to use your memory and go O (1) access for any template.

0


source


I think you are still thinking in old memory / disk mode.

mmap

won't help here because this old memory / disk thing is long gone. If you mmap a file, the kernel will return you a pointer to some virtual memory to use as it sees fit, it will not load the file into real memory right away, it will do this when you ask for a portion of the file and only load the pages you ask for. (That is, a page of memory, something usually around 4KB.)

you say these files are 300k in size, taking up 1.5 to 2.5 GB of disk space. If you have a chance, you can throw 2 (or better, 4) gigabytes of RAM to your server, you would be very much better off leaving this thing for reading in the OS, if it has enough RAM to load files into some disk cache, it will and of them, any read () on them won't even hit the disk. (It will store the atime in the inode unless you set your volume with noatime.)

If you try to read () files, write them into memory and from there from there, you now have the opportunity to know for sure that they will always be in RAM and not in swap, because the OS had other things related to that part of memory. which you have not used in a few seconds.

If you have enough RAM to allow the OS to do disk caching and you really want the files to load, you can always make a little script / program that will walk through your hierarchy and read all files. (Without anything else.) This will force the OS to load them from disk into the memory disk cache, but you cannot know that they will remain there if the OS needs memory. So what I said earlier, you have to let the OS handle it and give it enough RAM to do it.

You should read varnish Architecture Notes where phk tells you in its own words why what you are trying to achieve is so much better than the OS, which will always ever know the JVM better what is in RAM and what is not.

0


source


If you want quick access to all of these files, you can load them into memory, but I won't load them as files. I would put this data into some sort of object structure (in its simplest form, just a String).

What I would do is create a service that returns a file as an object structure from whatever parameter you use. Then implement some kind of caching mechanism around this service. Then it all depends on the cache setting. If you really need to load everything in memory, tweak your cache to use more memory. If some files are used much more than others, it may be sufficient to cache only those ...

We could probably give you a better answer if we knew more about what you are trying to achieve.

0


source


Put the files on 10 different servers, and instead of directly serving requests, send client-side HTTP redirects (or equivalent) with a URL where they can find the file they want. This allows the load to be distributed. The server just responds to fast requests and (large) downloads are spread across multiple machines.

0


source


If you are on Linux, I would try a good old RAM disk . You can stick with the existing way of doing things and simply cut your EO costs dramatically. You are not tied to JVM memory and can still replace content easily.

As you said about VFS: it also has a RAM disk vendor , but I would still try to use my own disk drive first.

0


source


You need to load all the information into the HashTable .

Load each file using its name as key and content as value, you can work an order of magnitude faster and easier than the setup you mean.

0


source







All Articles