You need to already have installed Java and Homebrew for this tutorial.
brew install scala
Install Apache Spark
Download Spark from apache.org.
tar -xzvf ~/Downloads/spark-2.0.2-bin-hadoop2.7.tgz -C /usr/local/bin mv /usr/local/bin/spark-2.0.2-bin-hadoop2.7.tgz /usr/local/bin/spark
Set some environmental variables in your shell rc
export SCALA_HOME="/usr/local/bin/scala" export SPARK_HOME="/usr/local/bin/spark" export PATH="$PATH:$SCALA_HOME/bin" export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH"
Create an alias for pyspark in your shell rc
Configure a local cluster
I used the instructions found on Pulasthi Wickramasinghe’s blog to guide my configuration. Specifically, I used these instructions to edit
There are a set of variables that you can set to override the default values. this can be done by putting in values in the “spark-env.sh” file. There is a template available “conf/spark-env.sh.template” you can use this template to create the spark-env.sh file. Several variable that can be added is mentioned in the template is self. we will add the following lines to the file.
export SCALA_HOME=/home/pulasthi/work/spark/scala-2.9.3 export SPARK_WORKER_MEMORY=2g export SPARK_EXECUTOR_INSTANCES=2 export SPARK_WORKER_DIR=/home/pulasthi/work/sparkdata
Here SPARK_WORKER_MEMORY specifies the amount of memory you want to allocate for a worker node if this value is not given the default value is the total memory available - 1G. Since we are running everything in our local machine we woundt want the slave the use up all our memory. I am running on a machine with 8GB of ram and since we are creating 2 slave node we will give each of the 2GB of ram.
The SPARK_EXECUTOR_INSTANCES specified the number of instances here its given as 2 since we will only create 2 slave nodes.
The SPARK_WORKER_DIR will be the location that the run applications will run and which will include both logs and scratch space. Make sure that the directory can be written to by the application that is permission are set properly.
SPARK_WORKER_DIR should be set to the actual locations on your local machine. After this configuration is specified, we can start the cluster up.
Launch the cluster
Launching doesn’t get any easier.
/usr/local/bin/spark/sbin/start-master.sh /usr/local/bin/spark/sbin/start-slaves.sh localhost
Then, read the log output at the location specified by the
start-master script, and it will notfiy you of the address and port location of your master node. In our case,
localhost:8080 should take us to a Zeppelin UI attached to the cluster, and
localhost:8081 should take us to the Spark web UI, where we can see that the master is running and attached to two slaves.
Configure PyCharm (optional)
If you’re using PyCharm to code, you might want to benefit from linting that recognizes the spark software, so here’s how to configure the PyCharm interpreter:
- Open PyCharm preferences
- Navigate to Project Interpreter menu (just search for it)
- Click on the arrow to expand the pulldown on the Project Interpreter field
- Select show all
- Highlight your project python interpreter by clicking on it
- Click the interpreter paths icon
Finally, add the following paths to your interpreter paths:
You are now able to import pyspark in your python files. You can also run spark interactively by typing
pyspark in the terminal.