Loading a lot of files from S3

What's the fastest way to get large amount of files (relatively small 10-50KB) from Amazon S3 from Python? (About 200,000 million files).

I am currently using boto to generate signed urls and using PyCURL to get the files one by one.

Did any type of concurrency help? PyCurl.CurlMulti object?

I am open to all suggestions. Thank!

+2


source to share


6 answers


In python's case, since this is IO tied, multiple threads will use the processor, but it will probably only use one core. If you have multiple cores, you might want to consider the new multiprocessor module . Even then, you may want each process to use multiple threads. You will need to tweak the number of processors and threads a bit.



If you are using multiple threads, this is a good candidate for the Queue class .

+1


source


I don't know anything about python, but in general you would like to split the task into smaller chunks so they can run at the same time. You can split it by file type or alphabetically or whatever, and then run a separate script for each part of the split.



+2


source


You can use s3fs and just run parallel filesystem commands from Python.

+1


source


I'm using txaws with twisted for S3 to work, although what you most likely want is to just get an authenticated url and use twisted.web.client.DownloadPage (will happily navigate from stream to file without much interaction by default) ...

Twisted makes it easy to run any concurrency. For something on the order of 200,000, I probably create a generator and use a co-op to set my concurrency and just let the generator generate every load request it requires.

If you are not familiar with twisted, you will find that the model takes a little time to get used to, but it is well worth it. In this case, I would expect it to take up minimal CPU and memory overhead, but you have to worry about the file descriptors. It's easy to mix in a promising broker and farm to run on multiple machines if you find you need more file descriptors, or if you have multiple connections that you want to pull it down.

0


source


what about thread + queue, i like this article: Practical Programming Using Python

0


source


Each task can be completed with the appropriate tools :)

You want to use python to stress test S3 :), so I suggest finding a large loader program and passing in a link to it.

On Windows, I have experience installing ReGet (shareware, from http://reget.com ) and creating download tasks via COM.

Of course, there may be other programs with a user-friendly interface.

Hello!

0


source







All Articles