How can I add a SparkListener from pySpark to Python?
I want to create a Jupyter / IPython extension for monitoring Apache Spark Jobs.
Spark provides REST APIs.
However, instead of polling the server, I want event updates to be sent via callbacks.
I am trying to register SparkListener
with SparkContext.addSparkListener()
. This feature is not available in the PySpark object SparkContext
in Python. So how can I register a python listener for the Scala / Java version of the context from Python. Can this be done through py4j
? I want python functions to be called when events are fired in a listener.
source to share
Possibly, although it's a bit related. We can use Py4j's callback mechanism to pass a message from SparkListener
. First, let's create a Scala package with all the required classes. Directory structure:
. ├── build.sbt └── src └── main └── scala └── net └── zero323 └── spark └── examples └── listener ├── Listener.scala ├── Manager.scala └── TaskListener.scala
build.sbt
:
name := "listener" organization := "net.zero323" scalaVersion := "2.11.7" val sparkVersion = "2.1.0" libraryDependencies ++= List( "org.apache.spark" %% "spark-core" % sparkVersion, "net.sf.py4j" % "py4j" % "0.10.4" // Just for the record )
Listener.scala
defines a Python interface that we are going to implement later
package net.zero323.spark.examples.listener /* You can add arbitrary methods here, * as long as these match corresponding Python interface */ trait Listener { /* This will be implemented by a Python class. * You can of course use more specific types, * for example here String => Unit */ def notify(x: Any): Any }
Manager.scala
will be used to forward messages to the Python listener:
package net.zero323.spark.examples.listener object Manager { var listeners: Map[String, Listener] = Map() def register(listener: Listener): String = { this.synchronized { val uuid = java.util.UUID.randomUUID().toString listeners = listeners + (uuid -> listener) uuid } } def unregister(uuid: String) = { this.synchronized { listeners = listeners - uuid } } def notifyAll(message: String): Unit = { for { (_, listener) <- listeners } listener.notify(message) } }
Finally, a simple one SparkListener
:
package net.zero323.spark.examples.listener import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} import org.json4s._ import org.json4s.JsonDSL._ import org.json4s.jackson.JsonMethods._ /* A simple listener which captures SparkListenerTaskEnd, * extracts numbers of records written by the task * and converts to JSON. You can of course add handlers * for other events as well. */ class PythonNotifyListener extends SparkListener { override def onTaskEnd(taskEnd: SparkListenerTaskEnd) { val recordsWritten = taskEnd.taskMetrics.outputMetrics.recordsWritten val message = compact(render( ("recordsWritten" -> recordsWritten) )) Manager.notifyAll(message) } }
Let's add our extension:
sbt package
and start a PySpark session by adding the generated one jar
to the classpath and registering a listener:
$SPARK_HOME/bin/pyspark \
--driver-class-path target/scala-2.11/listener_2.11-0.1-SNAPSHOT.jar \
--conf spark.extraListeners=net.zero323.spark.examples.listener.PythonNotifyListener
Next, we must define a Python object that implements the interface Listener
:
class PythonListener(object):
package = "net.zero323.spark.examples.listener"
@staticmethod
def get_manager():
jvm = SparkContext.getOrCreate()._jvm
manager = getattr(jvm, "{}.{}".format(PythonListener.package, "Manager"))
return manager
def __init__(self):
self.uuid = None
def notify(self, obj):
"""This method is required by Scala Listener interface
we defined above.
"""
print(obj)
def register(self):
manager = PythonListener.get_manager()
self.uuid = manager.register(self)
return self.uuid
def unregister(self):
manager = PythonListener.get_manager()
manager.unregister(self.uuid)
self.uuid = None
class Java:
implements = ["net.zero323.spark.examples.listener.Listener"]
start callback server:
sc._gateway.start_callback_server()
create and register a listener:
listener = PythonListener()
register it:
listener.register()
and the test:
>>> sc.parallelize(range(100), 3).saveAsTextFile("/tmp/listener_test")
{"recordsWritten":33}
{"recordsWritten":34}
{"recordsWritten":33}
On exit, you must close the callback server:
sc._gateway.shutdown_callback_server()
Note
This should be used with care when dealing with a Spark thread that uses a callback server internally.
Edit
If it's a big problem, you can simply define org.apache.spark.scheduler.SparkListenerInterface
:
class SparkListener(object):
def onApplicationEnd(self, applicationEnd):
pass
def onApplicationStart(self, applicationStart):
pass
def onBlockManagerRemoved(self, blockManagerRemoved):
pass
def onBlockUpdated(self, blockUpdated):
pass
def onEnvironmentUpdate(self, environmentUpdate):
pass
def onExecutorAdded(self, executorAdded):
pass
def onExecutorMetricsUpdate(self, executorMetricsUpdate):
pass
def onExecutorRemoved(self, executorRemoved):
pass
def onJobEnd(self, jobEnd):
pass
def onJobStart(self, jobStart):
pass
def onOtherEvent(self, event):
pass
def onStageCompleted(self, stageCompleted):
pass
def onStageSubmitted(self, stageSubmitted):
pass
def onTaskEnd(self, taskEnd):
pass
def onTaskGettingResult(self, taskGettingResult):
pass
def onTaskStart(self, taskStart):
pass
def onUnpersistRDD(self, unpersistRDD):
pass
class Java:
implements = ["org.apache.spark.scheduler.SparkListenerInterface"]
extend it:
class TaskEndListener(SparkListener):
def onTaskEnd(self, taskEnd):
print(taskEnd.toString())
and use it directly:
>>> sc._gateway.start_callback_server()
True
>>> listener = TaskEndListener()
>>> sc._jsc.sc().addSparkListener(listener)
>>> sc.parallelize(range(100), 3).saveAsTextFile("/tmp/listener_test_simple")
SparkListenerTaskEnd(0,0,ResultTask,Success,org.apache.spark.scheduler.TaskInfo@9e7514a,org.apache.spark.executor.TaskMetrics@51b8ba92)
SparkListenerTaskEnd(0,0,ResultTask,Success,org.apache.spark.scheduler.TaskInfo@71278a44,org.apache.spark.executor.TaskMetrics@bdc06d)
SparkListenerTaskEnd(0,0,ResultTask,Success,org.apache.spark.scheduler.TaskInfo@336)
While simpler, this method is not selective (more traffic between JVM and Python) requires processing Java objects within a Python session.
source to share