Installation: Apache Spark
Requirements
- A server running Rocky Linux
- Knowledge of the command-line and text editors
- Basic knowledge about installing and configuring network services
Introduction
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.1
Spark is often compared to Apache Hadoop, and specifically to MapReduce, Hadoop’s native data-processing component. The chief difference between Spark and MapReduce is that Spark processes and keeps the data in memory for subsequent steps—without writing to or reading from disk—which results in dramatically faster processing speeds. 2
This guide will demonstrate the requirements and steps needed to install Apache Spark for a master node on Rocky Linux and a slave node on FreeBSD. Based on this firm's need there a multiple configuration steps included in this document. Multiple external sources have been used to develop this guide. It should be noted that not all steps to the networking process are outlined here. Topics such as port-forwarding, reverse-proxies, DNS, or advanced configurations will not be discussed in this document.
Installation
Apache Spark - Master
Download the required Apache Spark version and extract it.
cd ~wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgztar xvf spark-3.3.1-bin-hadoop3.tgz
Move the files to a new directory.
mv spark-3.3.1-bin-hadoop3 /opt/spark
Add the spark binary to the path.
export PATH = $PATH:/opt/spark/binexport PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATHexport PATH = $SPARK_HOME/python:$PATH
Apache Spark - Slave
Download the required Apache Spark version and extract it.
cd ~wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgztar xvf spark-3.3.1-bin-hadoop3.tgz
Move the files to a new directory.
mv spark-3.3.1-bin-hadoop3 /usr/local/spark
Add the spark binary to the path.
ee ~/.cshrc/usr/local/spark/bin
Scala - Master
Install the scala package. The binary is automatically added to the path. 3
curl -fL https://github.com/coursier/launchers/raw/master/cs-x86_64-pc-linux.gz | gzip -d > cs && chmod +x cs && ./cs setup
Scala - Slave
Install the scala package.
pkg install scala
Add the scala binary to the path.
ee ~/.cshrc/usr/local/scala/bin
Configuration
Configuration - Master
Navigate to the spark directory and enter into the 'conf' folder.
cd /opt/spark/conf
Create copies of the default templates for us to modify.
sudo cp /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf.template sudo cp /opt/spark/conf/spark-env.sh.template /opt/spark/conf/spark-env.shsudo cp /opt/spark/conf/workers.template /opt/spark/conf/workers
Uncomment and modify the 'spark-env.sh' file to add the location of the SPARK_MASTER_HOST
.
export SPARK_MASTER_HOST='XXX.XXX.XXX.XXX'
For the purposes of the firm, embedding is allowed. To allow embedding of the Spark UI, append the below code to the end of the 'spark-defaults.conf' file.
spark.ui.allowFramingFrom http://localhost
Configuration - Slave
Ensure that the master can connect to the slave through ssh. This sub-heading does not go into detail on the the specific rules for ipfw allowances. Enable and then start the ssh service.
service sshd enableservice sshd start
Enable the firewall.
ipfw enable
Environment
Environment - Master
The firm maintains a the spark master binary in a separate folder location, environment, and OS. Without an extended enviroment configuration, the spark master will not be able to communicate with the slaves properly.
Create the SPARK_HOME
environment variable.
export SPARK_HOME=/opt/spark
Create the SPARK_SLAVE
environment variable.
export SPARK_SLAVE=/usr/local/spark
Modify the 'start-workers.sh' and 'stop-workers.sh' script to direct the service to the proper directory when it is sshing onto the slaves. There are a several locations in the files which require SPARK_HOME
to be altered to the newly created SPARK_SLAVE
environment variable.
Modify the 'start-workers.sh' script.
sudo vi /opt/spark/sbin/start-workers.shif [ -z "${SPARK_SLAVE}" ]; then export SPARK_SLAVE="$(cd "`dirname "$0"`"/..; pwd)"fi. "${SPARK_HOME}/sbin/spark-config.sh". "${SPARK_HOME}/bin/load-spark-env.sh"# Find the port number for the masterif [ "$SPARK_MASTER_PORT" = "" ]; then SPARK_MASTER_PORT=7077fiif [ "$SPARK_MASTER_HOST" = "" ]; then case `uname` in (SunOS) SPARK_MASTER_HOST="`/usr/sbin/check-hostname | awk '{print $NF}'`" ;; (*) SPARK_MASTER_HOST="`hostname -f`" ;; esacfi# Launch the workers"${SPARK_HOME}/sbin/workers.sh" cd "${SPARK_SLAVE}" \; "${SPARK_SLAVE}/sbin/start-worker.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"
Modify the 'stop-workers.sh' script.
sudo vi /opt/spark/sbin/stop-workers.shif [ -z "${SPARK_HOME}" ]; then export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"fi. "${SPARK_HOME}/sbin/spark-config.sh". "${SPARK_HOME}/bin/load-spark-env.sh""${SPARK_HOME}/sbin/workers.sh" cd "${SPARK_SLAVE}" \; "${SPARK_SLAVE}/sbin"/stop-worker.sh
Initialization
Once the configuration is complete properly, run the 'start-workers.sh' script. This will prompt the user to log into each instance. Log in to each slave.
./opt/spark/sbin/start-workers.sh
Once the configuration is complete properly, run the 'stop-workers.sh' script to stop the cluster.
./opt/spark/sbin/start-workers.sh
Conclusion
Apache Spark is an excellent to complete complex calculations in a faster timetable. The simplicity of tool enables the researchers to seamlessly connect to and run jobs, while IT managers can manage jobs by using the UI installed with Apache Spark. To see the Spark UI, users should visit the spark://<host-of-spark-master>:7077
It is possible to integrate Apache Spark in JupyterLab as well. By integrating this tool, even researchers can view an manage their jobs. To connect to a cluster simple run the code in a notebook prior to running the main program on the same kernel.
If you have considered setting up a PySpark cluster, use the below code to connect and disconnect to the cluster.
import findsparkfindspark.init() import pysparkfrom pyspark.sql import SparkSessionsc = pyspark.SparkContext(master='spark://XXX.XXX.XXX.XXX:7077', appName='test')
To stop the job run the below code.
sc.stop()