Python Data Pipelines and Streaming
My work involves a lot of data processing and streaming and processing of data from different sources, often with a lot of data. I use Python for everything and wonder what area of Python should I explore in order to optimize and build batch pipelines? I know there are some open source options like Luigi created by Spotify, but I think that's a bit of an overkill for me. The only thing I know so far is to look into generators and lazy evaluations, but I wonder what other concepts and libraries I can use to efficiently batch process in python. One example script would read a ton of formatted json files and convert them to csv before filling into the database, using as little memory as possible.(I need to use a standard SQL database, not NoSQL). Any advice would be greatly appreciated.
source to share
The example you mentioned reading a lot of files, translating and then populating the database reminds me of a signal processing application I wrote.
My application ( http://github.com/vmlaker/sherlock ) processes large chunks of data (images) in parallel using multi-core processors. I used two modules for a clean implementation: MPipe for building a multi-stage parallel pipeline and numpy-sharedmem for sharing NumPy arrays between processes.
If you're trying to maximize runtime performance and have multiple cores, you can create a similar workflow for the example you give:
Read File → Transfer → Update Database
Reading json files is related to I / O binding, but multiprocessing can lead to faster translation as well as database updates.
source to share