Run pyspark locally

I tried to follow the instructions in this book:

Learning Large Scale Machines with Python

It uses a virtual machine image to run Spark using Oracle and Vagrant virtual machines. I almost managed to get the VM to work, but I'm locked out of not having permission to enable virtualization in the BIOS (I have no password and no doubt my employer's IT department will let me enable this). See also discussion here .

Anyway, what other options should I play with spark locally (installed it locally)? My first goal is to get this Scala code:

scala> val file = sc.textFile("C:\\war_and_peace.txt")
scala> val warsCount = file.filter(line => line.contains("war"))
scala> val peaceCount = file.filter(line => line.contains("peace"))
scala> warsCount.count()
res0: Long = 1218
scala> peaceCount.count()
res1: Long = 128

      

works in Python. Any pointers would be much appreciated.

+3


source to share


1 answer


So, you can set up Spark to python and scala shells on windows, but the caveat is that in my experience the performance on windows is worse than osx and linux. If you want to go down the path of customizing all windows, I briefly wrote the instructions for this not too long ago for you to check here . I only paste the below text if I ever move a file out of this repo or link breaks for some other reason.

Loading and extracting spark

Download the latest spark from apache. Keep in mind that it is very important that you get the correct Hadoop binaries for the version of spark you choose. See section below. Extract with 7-zip.

Installing Java and Python

Install the latest 64-bit Java . Install Anaconda3 Python 3.5 64-bit (or whatever version of your choice) for all users. Restart the server.

Java and Python test

Open a command prompt and enter java -version

. If it is installed correctly, you will see output like this: java version "1.8.0_121" Java(TM) SE Runtime Environment (build 1.8.0_121-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Enter either python

or python --version

. The first will open the python shell after displaying the version information. The second will only show version information like this: Python 3.5.2 :: Anaconda 4.2.0 (64-bit)

Download Hadoop binary for Windows 64-bit

You probably don't have Hadoop installed on windows, but a spark deep in your kernel is looking for this file and possibly other binaries. Fortunately, the Hadoop contributor compiled them and has a repository of binaries for Hadoop 2.6. These binaries will work for spark version 2.0.2, but will not work for 2.1.0. To use spark 2.1.0 download binaries from here .



The best tactic for this is to clone the repo and keep the Hadoop folder corresponding to your spark version and add the hadoop-% version% folder to your path as HADOOP_HOME

.

Add Java and Spark to Environment

Add java path and light as environment variables JAVA_HOME and SPARK_HOME respectively.

Test pyspark

At the command prompt, enter pyspark

and view the output. At this point, the spark should start in the python shell.

Setting up pyspark to use the Jupyter notebook

Instructions for using interactive python shells with pyspark exist in the pyspark code and may be accessed through your editor. To use the Jupyter Notebook before running like pyspark, run the following two commands:

set PYSPARK_DRIVER_PYTHON=jupyter set PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Once these variables are set, pyspark will run in the Jupyter notebook with the default SparkContext initialized as sc

, and the SparkSession initialized as spark

. ProTip: Open http://127.0.0.1:4040 to view the spark interface, which contains a lot of useful information about your pipeline and completed processes. Any additional laptops open with spark-triggered will be in serial ports i.e. 4041, 4042, etc.

The bottom line is that getting the correct versions of the Hadoop binaries for your version of spark is critical. Otherwise, make sure your path and environment variable are set correctly.

+3


source







All Articles