Spark on YARN - Working with Spark from Django

I am developing a web application with the following components:

  • Apache Spark runs on a 3-node cluster (spark 1.4.0, hadoop 2.4 and YARN)
  • Django web application server

The Django app will create spark jobs "on demand" (they can be parallel jobs, depending on how many users the application is using)

I would like to know if there is a way to send spark jobs from python code to Django? Can pyspark be integrated into django? or Can I directly access the YARN API to submit jobs?

I know I can submit jobs to the cluster using a spark-submit script, but I try not to use it. (because it must be executing a shell command from code and is not very safe to do so)

Any help would be much appreciated.

Many thanks,

SOUTH.

+3


source to share


2 answers


Partial, untested answer: Django is a web framework, so it is difficult to manage long jobs (over 30 seconds), which is probably the case for your spark jobs.

So, you need an asynchronous job queue like celery. It's a bit of a pain (not that bad, but still), but I would suggest you start with this.



Then you will have:

  • Django to start / monitor jobs
  • rabbitMQ / celery asynchronous job queue
  • custom celery tasks using pySpark and launching sparks.
+2


source


There the project on github is called the Ooyala job server: https://github.com/ooyala/spark-jobserver .

This allows spark jobs to be sent to YARN via an HTTP request.



In Spark 1.4.0+, support was added for monitoring job status via HTTP request.

0


source







All Articles