Importing modules for code that works for workers

I wrote a simple task that filters rdd using a custom function that uses a module.

Where is the correct place to place the import operator?

  • placing imports in driver code doesn't help
  • doing the import inside the filter function works but doesn't look very good
+3


source to share


1 answer


You can submit jobs as batch operations with dependent modules using the command line interface spark-submit

. From Spark 1.6.1 documentation it has the following signature ...

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

      

If your python script is called python_job.py

and the module it depends on other_module.py

, you call



 ./bin/spark-submit python_job.py --py-files other_module.py

      

This will ensure that other_module.py is on the worker nodes. More often than not, you send a full package to send other_module_library.egg

or even .zip

. All this should be acceptable in --py-files

.

If you want to work in an interactive shell, I believe you will need to import the module inside this function.

0


source







All Articles