Installation: Apache Spark

Requirements

  • A server running Rocky Linux
  • Knowledge of the command-line and text editors
  • Basic knowledge about installing and configuring network services

Introduction

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.1

Spark is often compared to Apache Hadoop, and specifically to MapReduce, Hadoop’s native data-processing component. The chief difference between Spark and MapReduce is that Spark processes and keeps the data in memory for subsequent steps—without writing to or reading from disk—which results in dramatically faster processing speeds. 2

This guide will demonstrate the requirements and steps needed to install Apache Spark for a master node on Rocky Linux and a slave node on FreeBSD. Based on this firm's need there a multiple configuration steps included in this document. Multiple external sources have been used to develop this guide. It should be noted that not all steps to the networking process are outlined here. Topics such as port-forwarding, reverse-proxies, DNS, or advanced configurations will not be discussed in this document.

Installation

Apache Spark - Master

Download the required Apache Spark version and extract it.

cd ~wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgztar xvf spark-3.3.1-bin-hadoop3.tgz

Move the files to a new directory.

mv spark-3.3.1-bin-hadoop3 /opt/spark

Add the spark binary to the path.

export PATH = $PATH:/opt/spark/binexport PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATHexport PATH = $SPARK_HOME/python:$PATH

Apache Spark - Slave

Download the required Apache Spark version and extract it.

cd ~wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgztar xvf spark-3.3.1-bin-hadoop3.tgz

Move the files to a new directory.

mv spark-3.3.1-bin-hadoop3 /usr/local/spark

Add the spark binary to the path.

ee ~/.cshrc/usr/local/spark/bin

Scala - Master

Install the scala package. The binary is automatically added to the path. 3

curl -fL https://github.com/coursier/launchers/raw/master/cs-x86_64-pc-linux.gz | gzip -d > cs && chmod +x cs && ./cs setup

Scala - Slave

Install the scala package.

pkg install scala

Add the scala binary to the path.

ee ~/.cshrc/usr/local/scala/bin

Configuration

Configuration - Master

Navigate to the spark directory and enter into the 'conf' folder.

cd /opt/spark/conf

Create copies of the default templates for us to modify.

sudo cp /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf.template sudo cp /opt/spark/conf/spark-env.sh.template /opt/spark/conf/spark-env.shsudo cp /opt/spark/conf/workers.template /opt/spark/conf/workers

Uncomment and modify the 'spark-env.sh' file to add the location of the SPARK_MASTER_HOST.

export SPARK_MASTER_HOST='XXX.XXX.XXX.XXX'

For the purposes of the firm, embedding is allowed. To allow embedding of the Spark UI, append the below code to the end of the 'spark-defaults.conf' file.

spark.ui.allowFramingFrom http://localhost

Configuration - Slave

Ensure that the master can connect to the slave through ssh. This sub-heading does not go into detail on the the specific rules for ipfw allowances. Enable and then start the ssh service.

service sshd enableservice sshd start

Enable the firewall.

ipfw enable

Environment

Environment - Master

The firm maintains a the spark master binary in a separate folder location, environment, and OS. Without an extended enviroment configuration, the spark master will not be able to communicate with the slaves properly.

Create the SPARK_HOME environment variable.

export SPARK_HOME=/opt/spark

Create the SPARK_SLAVE environment variable.

export SPARK_SLAVE=/usr/local/spark

Modify the 'start-workers.sh' and 'stop-workers.sh' script to direct the service to the proper directory when it is sshing onto the slaves. There are a several locations in the files which require SPARK_HOME to be altered to the newly created SPARK_SLAVEenvironment variable.

Modify the 'start-workers.sh' script.

sudo vi /opt/spark/sbin/start-workers.shif [ -z "${SPARK_SLAVE}" ]; then  export SPARK_SLAVE="$(cd "`dirname "$0"`"/..; pwd)"fi. "${SPARK_HOME}/sbin/spark-config.sh". "${SPARK_HOME}/bin/load-spark-env.sh"# Find the port number for the masterif [ "$SPARK_MASTER_PORT" = "" ]; then  SPARK_MASTER_PORT=7077fiif [ "$SPARK_MASTER_HOST" = "" ]; then  case `uname` in      (SunOS)          SPARK_MASTER_HOST="`/usr/sbin/check-hostname | awk '{print $NF}'`"          ;;      (*)          SPARK_MASTER_HOST="`hostname -f`"          ;;  esacfi# Launch the workers"${SPARK_HOME}/sbin/workers.sh" cd "${SPARK_SLAVE}" \; "${SPARK_SLAVE}/sbin/start-worker.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"

Modify the 'stop-workers.sh' script.

sudo vi /opt/spark/sbin/stop-workers.shif [ -z "${SPARK_HOME}" ]; then  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"fi. "${SPARK_HOME}/sbin/spark-config.sh". "${SPARK_HOME}/bin/load-spark-env.sh""${SPARK_HOME}/sbin/workers.sh" cd "${SPARK_SLAVE}" \; "${SPARK_SLAVE}/sbin"/stop-worker.sh

Initialization

Once the configuration is complete properly, run the 'start-workers.sh' script. This will prompt the user to log into each instance. Log in to each slave.

./opt/spark/sbin/start-workers.sh

Once the configuration is complete properly, run the 'stop-workers.sh' script to stop the cluster.

./opt/spark/sbin/start-workers.sh

Conclusion

Apache Spark is an excellent to complete complex calculations in a faster timetable. The simplicity of tool enables the researchers to seamlessly connect to and run jobs, while IT managers can manage jobs by using the UI installed with Apache Spark. To see the Spark UI, users should visit the spark://<host-of-spark-master>:7077

It is possible to integrate Apache Spark in JupyterLab as well. By integrating this tool, even researchers can view an manage their jobs. To connect to a cluster simple run the code in a notebook prior to running the main program on the same kernel.

If you have considered setting up a PySpark cluster, use the below code to connect and disconnect to the cluster.

import findsparkfindspark.init() import pysparkfrom pyspark.sql import SparkSessionsc = pyspark.SparkContext(master='spark://XXX.XXX.XXX.XXX:7077', appName='test')

To stop the job run the below code.

sc.stop()

  1. Apache Spark
  2. Apache Spark Installation