Automatic shutdown of Google DataProc cluster after completion of all tasks

How can I programmatically shutdown the Google DataProc cluster automatically after all jobs have finished?

DataProc provides creation, monitoring and management ( https://cloud.google.com/dataproc/docs/resources/faq ). But I cannot figure out how to delete the cluster.

+4


source to share


4 answers


You can do this with scala code: - Create cluster - Start all work - When jobs are complete Delete cluster

You can work with scala Future for this. If you have many tasks, you can run them in parallel:



val gcpJarBucket = "gs://test_dataproc/dataproc/Dataproc.jar"
val jobs = Seq("package.class1","package.class2")
val projectName: String = "automat-dataproc"
val clusterName: String = "your-cluster-name"

 val timeout = 180 minute

  // Working directory
  implicit val wd = pwd


  val future = Future {
    println("Creating the spark cluster...")
    % gcloud("dataproc", "clusters", "create", clusterName, "--subnet", "default", "--zone", "europe-west1-b", "--master-machine-type", "n1-standard-4", "--master-boot-disk-size", "50", "--num-workers", "3", "--worker-machine-type", "n1-standard-4", "--worker-boot-disk-size", "50", "--project", projectName)
    println("Creating the spark cluster...DONE")
  }.flatMap { _ => {
    Future.sequence {
      jobs.map { jobClass =>
        Future {
          println(s"Launching the spark job from the class $jobClass...")
          % gcloud("dataproc", "jobs", "submit", "spark", s"--cluster=$clusterName", s"--class=$jobClass", "--region=global", s"--jars=$gcpJarBucket")
          println(s"Launching the spark job from the class $jobClass...DONE")
        }
      }
    }

  }}

Try{ Await.ready(future, timeout) }.recover{ case exp => println(exp) }
  % bash("-c", s"printf 'Y\n' | gcloud dataproc cl

      

+1


source


There are several programmable ways to automatically shutdown a cluster:



Any of them can be used (called) after your work is finished.

More details here: https://cloud.google.com/dataproc/docs/guides/manage-cluster#delete_a_cluster

0


source


It depends on the language. Personally, I use Python (pyspark) and the code given here worked fine for me:

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/dataproc/submit_job_to_cluster.py

You may need to adapt the code to your goal and follow the preliminary steps given in the README file ( https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/dataproc ), for example, enable the API and install packages to file requirements.txt

.

In fact, using the function wait_for_job

you delete_cluster

complete the task, and using delete_cluster

, as the name suggests, the previously created cluster is deleted. I hope this helps you.

0


source


The gcloud beta dataproc cli interface offers a "max-idle" option. This will automatically kill the Dataproc cluster after x inactivity (that is, no jobs are running). Can be used like this:

gcloud beta dataproc clusters create test-cluster\--project my-test-project\--zone europe-west1-b\--master-machine-type n1-standard-4\--master-boot-disk-size 100\--num-workers 2\--worker-machine-type n1-standard-4\--worker-boot-disk-size 100\--max-idle 1h

0


source







All Articles