Create and display light data block from simple json file

Question

Create and display light data block from simple json file

The following simple json DataFrame test works fine when running Spark in local mode. Here is a Scala snippet, but I successfully got the same thing in Java and Python:

sparkContext.addFile(jsonPath)

val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
val dataFrame = sqlContext.jsonFile(jsonPath)
dataFrame.show()

I made sure the jsonPath works from both the driver and worker side. And I call addFile ... The json file is very trivial:

[{"age":21,"name":"abc"},{"age":30,"name":"def"},{"age":45,"name":"ghi"}]

The exact same code crashes when I turn off local mode and use a separate Spark server with one master / worker. I've tried this same test in Scala, Java and Python to try and find some combination that works. They all get the same error. The following error is from a Scala driver program, but the Java / Python error messages are almost identical:

15/04/17 18:05:26 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.0.2.15): java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)
    at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
    at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
    at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)

This is very unpleasant. I'm basically trying to get the code snippets from the official docs to work.

UPDATE: Thanks Paul for the insightful answer. I am getting errors doing the same steps. FYI, I used to use a driver program before, hence the name sparkContext and not the default wrapper name sc. The following is an abbreviated redundantly logged snippet:

➜  spark-1.3.0  ./bin/spark-shell --master spark://172.28.128.3:7077
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
      /_/

Using Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val dataFrame = sqlContext.jsonFile("/private/var/userspark/test.json")
15/04/20 18:01:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.0.2.15): java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)
    at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
    at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
    at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
    at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
    at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237)
    (...)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.0.2.15): java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)

+3

dataframe apache-spark

clay Apr 17 15 at 11:15 pm

source to share

1 answer

Paul · Answer 1 · 2015-04-18T03:40:32+0000

While I can get your simple example to work, I agree that the spark can be frustrating ...

Here I have spark 1.3.0 built from source with openjdk 8.

Using your file with spark-shell

or spark-submit

fails for a variety of reasons, where perhaps the examples / docs are outdated compared to the released code and need to be slightly tweaked.

For example, in the spark shell is sparkContext

already available as sc

, not as sparkContext

, and there is a similar predefined one sqlContext

. Sheath Sparks emits INFO messages announcing the creation of these contexts.

For spark-submit

, I am getting some kind of jar error. This may be a local issue.

Anyway, it works great if I shorten it. It also doesn't matter, for the purpose of running this short example, whether the json file has one object per line or not. For a future test, it would be helpful to create a large example and determine if it runs in parallel across cores and if it requires one object per line to execute (no comma or upper parenthesis).

so1-works.sc

val dataFrame = sqlContext.jsonFile("/data/so1.json")
dataFrame.show()

Quit, suppress INFO messages, etc ...

paul@ki6cq:~/spark/spark-1.3.0$ ./bin/spark-shell --master spark://192.168.1.10:7077 <./so1-works.sc 2>/dev/null
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
      /_/

Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.8.0_40-internal)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> 

scala> val dataFrame = sqlContext.jsonFile("/data/so1.json")
dataFrame: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> dataFrame.show()
age name
21  abc 
30  def 
45  ghi 

scala> Stopping spark context.
paul@ki6cq:~/spark/spark-1.3.0$ paul@ki6cq:~/spark/spark-1.3.0$

Odd note : After this I need to execute reset

to get my Linux terminal back to normal.

So first, try to shorten the example as I did it.

If that doesn't fix it, you can try duplicating my environment.

It might be simple as I am using docker for master and worker and have hosted images in a public docker hub.

Note to future readers: My public docker images are not official Spark images and are subject to change or removal.

You need two computers (one running Linux or a docking compatible OS to host the host and worker in docker containers, the other also preferably running Linux or something like the spark-1.3.0 build) behind the home device firewall router (DLink, Netgrear, etc.). I assume the local network is 192.168.1. * That 192.168.1.10 and .11 are free and that the router will route correctly, or you know how to set up the route correctly. You can change these addresses in the run script below.

If you only have one computer, due to the nature of the bridge, the networking methods I have used here will probably not work as expected to communicate with the host. It can be made to work, but that's a bit more than I'd like to add to an already lengthy post.

On one Linux machine, install docker , a pipeline utility , and these shell scripts (adjusting the memory provided for the spark, editing additional workers doesn't seem to be required):

./Introductory docker-spark

#!/bin/bash
sudo -v
MASTER=$(docker run --name="master" -h master --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env SPARK_MASTER_IP=192.168.1.10 -d drpaulbrewer/spark-master:latest)
sudo pipework eth0 $MASTER 192.168.1.10/24@192.168.1.1
SPARK1=$(docker run --name="spark1" -h spark1 --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env mem=10G --env master=spark://192.168.1.10:7077 -v /data:/data -v /tmp:/tmp -d drpaulbrewer/spark-worker:latest)
sudo pipework eth0 $SPARK1 192.168.1.11/24@192.168.1.1

./stop-docker-spark

#!/bin/bash
docker kill master spark1
docker rm master spark1

Another linux machine will be your user machine and needs the spark-1.3.0 build. Create / data directory on both computers and install json file there. Then run. / run -docker-spark just once, on a machine that acts as a joint host for containers (like virtual machines) that will hold the master and worker. To stop the spark system use the stop script. If you reboot or get a bad error, you need to run the stop script before running the script again.

Check that Master and Worker mate at http://192.168.1.10:8080

If so, then you should well try the spark shell command line up.

You don't need these docker-compose files because the assemblies are published on a public docker hub and the downloads are automatically started in docker mode. But here they are in case you want to see how everything is built, JDK, maven command, etc.

I start with a generic Dockerfile which I put in a named directory spark-roasted-elephant

as this is a non-hadoop build and hadoop had an elephant in the O'Reilley book. You need the original tarball spark-1.3.0 from the spark site to get into the directory with the Dockerfile. This Dockerfile is probably not detecting enough ports (spark is very messy about port usage, while docker is unfortunately designed to store and document port usage) and the excerpt is overridden in shell scripts that wizard and worker run. It will cause some misery if you ask docker to list what works because it is a list of ports.

paul@home:/Z/docker$ cat ./spark-roasted-elephant/Dockerfile
# Copyright 2015 Paul Brewer http://eaftc.com
# License: MIT
# this docker file builds a non-hadoop version of spark for standalone experimentation
# thanks to article at http://mbonaci.github.io/mbo-spark/ for tips
FROM ubuntu:15.04
MAINTAINER drpaulbrewer@eaftc.com
RUN adduser --disabled-password --home /spark spark
WORKDIR /spark
ADD spark-1.3.0.tgz /spark/ 
WORKDIR /spark/spark-1.3.0
RUN sed -e 's/archive.ubuntu.com/www.gtlib.gatech.edu\/pub/' /etc/apt/sources.list > /tmp/sources.list && mv /tmp/sources.list /etc/apt/sources.list
RUN apt-get update && apt-get --yes upgrade \
    && apt-get --yes install sed nano curl wget openjdk-8-jdk scala \
    && echo "JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >>/etc/environment \
    && export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" \
    && ./build/mvn -Phive -Phive-thriftserver -DskipTests clean package \
    && chown -R spark:spark /spark \
    && mkdir /var/run/sshd
EXPOSE 2222 4040 6066 7077 7777 8080 8081

The wizard is created from the catalog. / spark -master using a dockerfile and a shell script to include in the container. Here is that docker file and shell script.

paul@home:/Z/docker$ cat ./spark-master/Dockerfile
FROM drpaulbrewer/spark-roasted-elephant:latest
MAINTAINER drpaulbrewer@eaftc.com
ADD my-spark-master.sh /spark/
USER spark
CMD /spark/my-spark-master.sh

paul@home:/Z/docker$ cat ./spark-master/my-spark-master.sh
#!/bin/bash -e
cd /spark/spark-1.3.0
# set SPARK_MASTER_IP to a net interface address, e.g. 192.168.1.10
export SPARK_MASTER_IP
./sbin/start-master.sh 
sleep 10000d

And for the worker:

paul@home:/Z/docker$ cat ./spark-worker/Dockerfile
FROM drpaulbrewer/spark-roasted-elephant:latest
MAINTAINER drpaulbrewer@eaftc.com
ADD my-spark-worker.sh /spark/
CMD /spark/my-spark-worker.sh
paul@home:/Z/docker$ cat ./spark-worker/my-spark-worker.sh
#!/bin/bash -e
cd /spark/spark-1.3.0
sleep 10
# dont use ./sbin/start-slave.sh it wont take numeric URL
mkdir -p /Z/data
mkdir -p /user/hive/warehouse
chown -R spark:spark /user
su -c "cd /spark/spark-1.3.0 && ./bin/spark-class org.apache.spark.deploy.worker.Worker --memory $mem $master" spark

Although by now this post has turned into an answer to the question "how can I make Dockerfiles for the spark?" it is not intended. These Dockerfiles are experimental for me, I don't use them in production and I can't vouch for their quality. I didn't like the spark source on Docker because they were doing the lazy practice of bundling a bunch of containers together and it was huge and took forever to load. There are significantly fewer layers and less load. This is posted not like the docker example, but you can determine what is different in your environment.

Create and display light data block from simple json file

More articles: