In this blog post, we set up Apache Spark and Apache Airflow using a Docker container, and in the end, we ran and scheduled Spark jobs using Airflow which is deployed on a Docker container. This is very important because, with Docker images, we are able to solve problems we encountered in development. For example, problems that relate to a different environment, dependencies issues e.t.c, thereby leading to fast development, and deployment to production.
Some of the advantages of using Docker container to run Spark jobs are;
Interestingly, different companies and individuals have published Spark images on Dockerhub, you can pull the images, and try them out to see if they address your development use case. In our case, I was able to try the images provided by Bitnami Spark and Datamechanics, although these companies have the highest downloads on the Dockerhub, there are still many different options. I have not been able to test and explore all of these images personally, but I tested the Spark image provided by Datamechanics and Bitnami Spark which works for the ETL workflow provided here. We are not going to look at the Airflow Docker images in this blog post, but there is comprehensive documentation on how to set up AIrflow on Docker images here.
For this blog post, we would be using the one provided by Bitnami Spark because it has more features and is easy to use. You can extend the Docker image of Spark and Airflow using Dockerfiles by including dependencies, and versions of Java, Spark and Scala in relation to external systems needs. You can package your deployment suite, create a Spark image from this and run your applications on containers.
In our case, I was working to schedule Spark jobs with Airflow sitting on a Docker container. One of the problems we encountered in this process was that Airflow was looking for the necessary Spark dependencies required to run, and submit Spark jobs to Airflow. Also, on the Airflow instance, Java is very much required before Airflow can run and schedule Spark jobs successfully.
In the Github repository, we have provided Dockerfiles for Spark, and Airflow, the files will allow you to extend and design the Docker image based on what your use cases are. Interestingly, you can run Airflow and Spark locally, but we are not going to show it in this blog post. There is another blog post about that here, where we showed how to run Spark and Airflow directly on your machine.
In this article, the process is that we are going to extract data from the Restful API, transform the JSON data into a data frame, and load the data into an S3 bucket, then we will go a step further to schedule and run our ETL process on Airflow. The process goes thus;
git clone https://github.com/yTek01/docker-spark-airflow.git
After cloning the Github repository, you can now extend the Spark container to include all the dependencies you would need for your ETL process. In our case, we have included the AWS Java SDK and Hadoop AWS library which are needed for pushing data to the S3 bucket. Please go to the repo to access all the blog codes.
FROM bitnami/spark:3 USER root RUN curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.231/aws-java-sdk-bundle-1.12.231.jar --output /opt/bitnami/spark/jars/aws-java-sdk-bundle-1.12.231.jar RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar --output /opt/bitnami/spark/jars/hadoop-aws-3.3.1.jar RUN curl https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar --output /opt/bitnami/spark/jars/jets3t-0.9.4.jar COPY ./dags/spark_etl_script_docker.py /opt/bitnami/spark COPY ./requirements_copy.txt / RUN pip install -r /requirements_copy.txt
docker build -f Dockerfile.Spark . -t spark-air
With that, we are going to build the Airflow image, and also include all the necessary dependencies like Java, Python library necessary to run Spark jobs, and other dependencies needed by Airflow to run smoothly.
FROM apache/airflow:2.2.5 USER root # Install OpenJDK-11 RUN apt update && \ apt-get install -y openjdk-11-jdk && \ apt-get install -y ant && \ apt-get clean; # Set JAVA_HOME ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64/ RUN export JAVA_HOME USER airflow COPY ./requirements.txt / RUN pip install -r /requirements.txt COPY --chown=airflow:root ./dags /opt/airflow/dags
docker build -f Dockerfile.Airflow . -t airflow-spark
With all the images ready, we can now connect Airflow and Spark together. Follow the command below to create the environment variables, create the dags folder, the logs folder, and the plugins folder. Make sure to include the images tagged name you build in the previous section into the docker-compose.yaml file for each of Airflow and Spark.
mkdir ./dags ./logs ./plugins echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env
AIRFLOW_UID=33333 AIRFLOW_GID=0 AWS_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXX AWS_SECRET_KEY=XXXXXXXXXXXXXXXXXXXX
docker-compose -f docker-compose.Spark.yaml -f docker-compose.Airflow.yaml up -d
If something should fail at this point please go back to the issue and make sure you solve the problem before you proceed to Airflow.
docker exec -it spark-jobs-spark-worker-1 \ spark-submit --master spark://XXXXXXXXXXXXX:7077 \ spark_etl_script_docker.py
After we have tested to see that our deployment all works as intended, if all is fine with the setup, then we would move forward to scheduling our Spark job on Airflow. Very importantly, the Spark connection on the Airflow UI would look like the following.
In general, the problems we encountered are dependency issues of various nature. One of these is that Airflow and Spark Python drivers must be of the same version for Airflow and Spark to work properly. I ran into the driver issue during development, the solution is that I upgraded the Python version running in Airflow to that of Spark. Another part of this is that Java must be running in the Airflow Docker container, in our case, we have installed Java 11. But for you to send data to the S3 bucket directly from Airflow, you must have the Java version that meets the requirement of Py4J Python library running on your Airflow instance.
Another problem you might run into is the Spark dependencies required to send data to S3 bucket or to work with AWS resources. This is very important, make sure that all the dependencies are included in the Jar folder of the $Spark_HOME. Interestingly, one thing you should keep in mind is that the recommended way of sending or retrieving data to/from S3 bucket is using the S3 access point. Make sure you create the resources rightly, you can check this blog that shows you how to create the S3 access point the right way. With all these, you can now set up your own Spark and Airflow deployment on Docker container and let Airflow schedule and run your Spark jobs. You can access all the codes used in this blog post on GitHub.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!