Pyspark reads caffe models from HDFS

Question

Pyspark reads caffe models from HDFS

I am using the caffe library to detect images using the PySpark framework. I can run the spark program in local mode where the model is present on the local filesystem.

But when I want to deploy it in cluster mode, I don't know how to do it correctly. I've tried the following approach:

Add files to HDFS and use addfile

or --file

when submitting jobs

sc.addFile("hdfs:///caffe-public/dataset/test.caffemodel")
Reading the model in each working node with

model_weight =SparkFiles.get('test.caffemodel') net = caffe.Net(model_define, model_weight, caffe.TEST)

Since it SparkFiles.get()

will return the local file location in the working node (not HDFS) so that I can restore my model using the path it returns. This approach also works fine in local mode, however in distributed mode it will result in the following error:

ERROR server.TransportRequestHandler: Error sending result StreamResponse{streamId=/files/xxx, byteCount=xxx, body=FileSegmentManagedBuffer{file=xxx, offset=0,length=xxxx}} to /192.168.100.40:37690; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V

It looks like the data is too big to be shuffled, as discussed in Apache Spark: Networking Errors Between Executors However, the model size is only about 1M.

Updated:

I found that if the path in sc.addFile(path)

is on HDFS, no error will appear. However, when the path is on the local filesystem, an error will appear.

My questions

Is there any other possibility that will throw the above exception? than the file size. (Spark is powered by YARN and I am using the default shuffle service, not the external shuffle service)
If I don't add the file on upload, how can I read the model file from HDFS using PySpark? (So that I can restore the model using the caffe API). Or is there a way to get a different path from SparkFiles.get()

?

Any suggestions would be appreciated!

+3

apache-spark pyspark

steve 24 Mar 17 at 19:22

source to share

No one has answered this question yet

See similar questions:

17

Apache Spark: network errors between executors

or similar:

37

How does Spark (ing) work with files in HDFS?

1

Add / deploy dependency library in pyspark environment

1

PySpark DataFrame model statistics could not be collected or converted to RDD

1

Spark Send python failed when trying to access HDFS in cluster mode

1

Pyspark running external program using subprocess cannot read files from hdfs

1

Import xgboost model into pyspark script directly from HDFS

0

pySpark file add parameter what happens at the executor workstation

0

Connection refused when reading file from hdfs with pyspark

0

Pyspark: upload files directly to hdfs

-1

Setting up Pyspark with Ananonda

Pyspark reads caffe models from HDFS

More articles: