Loading a lot of files from S3
What's the fastest way to get large amount of files (relatively small 10-50KB) from Amazon S3 from Python? (About 200,000 million files).
I am currently using boto to generate signed urls and using PyCURL to get the files one by one.
Did any type of concurrency help? PyCurl.CurlMulti object?
I am open to all suggestions. Thank!
source to share
In python's case, since this is IO tied, multiple threads will use the processor, but it will probably only use one core. If you have multiple cores, you might want to consider the new multiprocessor module . Even then, you may want each process to use multiple threads. You will need to tweak the number of processors and threads a bit.
If you are using multiple threads, this is a good candidate for the Queue class .
source to share
I'm using txaws with twisted for S3 to work, although what you most likely want is to just get an authenticated url and use twisted.web.client.DownloadPage (will happily navigate from stream to file without much interaction by default) ...
Twisted makes it easy to run any concurrency. For something on the order of 200,000, I probably create a generator and use a co-op to set my concurrency and just let the generator generate every load request it requires.
If you are not familiar with twisted, you will find that the model takes a little time to get used to, but it is well worth it. In this case, I would expect it to take up minimal CPU and memory overhead, but you have to worry about the file descriptors. It's easy to mix in a promising broker and farm to run on multiple machines if you find you need more file descriptors, or if you have multiple connections that you want to pull it down.
source to share
what about thread + queue, i like this article: Practical Programming Using Python
source to share
Each task can be completed with the appropriate tools :)
You want to use python to stress test S3 :), so I suggest finding a large loader program and passing in a link to it.
On Windows, I have experience installing ReGet (shareware, from http://reget.com ) and creating download tasks via COM.
Of course, there may be other programs with a user-friendly interface.
Hello!
source to share