Importing modules for code that works for workers

Question

Importing modules for code that works for workers

I wrote a simple task that filters rdd using a custom function that uses a module.

Where is the correct place to place the import operator?

placing imports in driver code doesn't help
doing the import inside the filter function works but doesn't look very good

+3

apache-spark pyspark

Ophir Yoktan 09 June 15 at 18:18

source to share

1 answer

user5153326 · Answer 1 · 2016-03-17T21:41:50+0000

You can submit jobs as batch operations with dependent modules using the command line interface spark-submit

. From Spark 1.6.1 documentation it has the following signature ...

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

If your python script is called python_job.py

and the module it depends on other_module.py

, you call

 ./bin/spark-submit python_job.py --py-files other_module.py

This will ensure that other_module.py is on the worker nodes. More often than not, you send a full package to send other_module_library.egg

or even .zip

. All this should be acceptable in --py-files

.

If you want to work in an interactive shell, I believe you will need to import the module inside this function.

Importing modules for code that works for workers

More articles: