Clojure: parallel processing using multiple computers

I have 500 directories and 1000 files (each about 3-4k lines) for each directory. I want to run the same clojure program (already written) for each of these files. I have 4 octa servers. what is a good way to propagate processes in these kernels? cascalog (hadoop + clojure)?

basically the program reads the file, uses a third party java jar to do the calculations and inserts the results into the DB

Please note that: 1. the ability to use third-party libraries / jar is required 2. no requests of any type

+3


source to share


2 answers


Since there is no "pruning" in my overall process, from what I understand it makes sense to put 125 directories on each server and then spend the rest of the time trying to speed up the process of that program. To the point where you saturate the DB, of course.



Most of the "big data" tools available (Hadoop, Storm) focus on processes that require both very powerful maps and reduction operations, possibly at several stages each. In your case, all you really need is a decent way to keep track of which assignments have passed and which have not. I am as bad as anyone (and even worse than many) at predicting development time, although in this case I would say that even the randomness of rewriting your process on one of the map shortening tools will take more time than adding a monitoring process. to keep track of which jobs were completed and which failed, so you can rerun failed ones later (preferably automatically).

+1


source


Onyx is Clojure's recent pure Hadoop / Storm alternative. As long as you are familiar with Clojure, working with Onyx is pretty straightforward. You should try this data driven approach:



https://github.com/MichaelDrogalis/onyx

+1


source







All Articles