Airflow and Spark: Running Spark Jobs in Airflow (Docker-based Solution)

Airflow and Spark: Running Spark Jobs on Airflow (Docker-based Solution)

In this blog post, we set up Apache Spark and Apache Airflow using a Docker container, and in the end, we ran and scheduled Spark jobs using Airflow which is deployed on a Docker container. This is very important because, with Docker images, we are able to solve problems we encountered in development. For example, problems that relate to a different environment, dependencies issues e.t.c, thereby leading to fast development, and deployment to production.

Some of the advantages of using Docker container to run Spark jobs are;

  • You can share your development with your co-worker with all the necessary dependencies included, and set up the full application within a few minutes
  • You can build, and deploy your development on AWS, Azure, GCP or any cloud platform, and expect that you would get the same result
  • You have the ability to package multiple applications on the same machine, for example, you can run Spark, Kafka, and Airflow on the same machine, and all your deployments will work as expected. Even if the application requires different versions of dependencies to run independently. You can isolate them with Docker containers
  • You can aggregate the amount of RAM and CPU that would be needed by different running applications on your machine thereby managing resources
Spark and Airflow communicating together

Spark Images

Interestingly, different companies and individuals have published Spark images on Dockerhub, you can pull the images, and try them out to see if they address your development use case. In our case, I was able to try the images provided by Bitnami Spark and Datamechanics, although these companies have the highest downloads on the Dockerhub, there are still many different options. I have not been able to test and explore all of these images personally, but I tested the Spark image provided by Datamechanics and Bitnami Spark which works for the ETL workflow provided here. We are not going to look at the Airflow Docker images in this blog post, but there is comprehensive documentation on how to set up AIrflow on Docker images here.

For this blog post, we would be using the one provided by Bitnami Spark because it has more features and is easy to use. You can extend the Docker image of Spark and Airflow using Dockerfiles by including dependencies, and versions of Java, Spark and Scala in relation to external systems needs. You can package your deployment suite, create a Spark image from this and run your applications on containers.

In our case, I was working to schedule Spark jobs with Airflow sitting on a Docker container. One of the problems we encountered in this process was that Airflow was looking for the necessary Spark dependencies required to run, and submit Spark jobs to Airflow. Also, on the Airflow instance, Java is very much required before Airflow can run and schedule Spark jobs successfully. 

In the Github repository, we have provided Dockerfiles for Spark, and Airflow, the files will allow you to extend and design the Docker image based on what your use cases are. Interestingly, you can run Airflow and Spark locally, but we are not going to show it in this blog post. There is another blog post about that here, where we showed how to run Spark and Airflow directly on your machine.

Spark job with ETL process running on Airflow

In this article, the process is that we are going to extract data from the Restful API, transform the JSON data into a data frame, and load the data into an S3 bucket, then we will go a step further to schedule and run our ETL process on Airflow. The process goes thus;

  • Clone the Github repository
git clone https://github.com/yTek01/docker-spark-airflow.git

After cloning the Github repository, you can now extend the Spark container to include all the dependencies you would need for your ETL process. In our case, we have included the AWS Java SDK and Hadoop AWS library which are needed for pushing data to the S3 bucket. Please go to the repo to access all the blog codes.

  • Include all your dependencies as needed by the external resources using Dockerfile. In our case, the Spark Dockerfile looks like the following;
FROM bitnami/spark:3
USER root
RUN curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.231/aws-java-sdk-bundle-1.12.231.jar --output /opt/bitnami/spark/jars/aws-java-sdk-bundle-1.12.231.jar
RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar --output /opt/bitnami/spark/jars/hadoop-aws-3.3.1.jar
RUN curl https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar --output /opt/bitnami/spark/jars/jets3t-0.9.4.jar

COPY ./dags/spark_etl_script_docker.py /opt/bitnami/spark

COPY ./requirements_copy.txt /
RUN pip install -r /requirements_copy.txt
  • Build the Spark image using the command below
docker build -f Dockerfile.Spark . -t spark-air

With that, we are going to build the Airflow image, and also include all the necessary dependencies like Java, Python library necessary to run Spark jobs, and other dependencies needed by Airflow to run smoothly.

  • Include all the dependencies as needed by Airflow, in our case, the Airflow Dockerfile looks like this.
FROM apache/airflow:2.2.5
USER root

# Install OpenJDK-11
RUN apt update && \
    apt-get install -y openjdk-11-jdk && \
    apt-get install -y ant && \
    apt-get clean;

# Set JAVA_HOME
ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64/
RUN export JAVA_HOME

USER airflow

COPY ./requirements.txt /
RUN pip install -r /requirements.txt
COPY --chown=airflow:root ./dags /opt/airflow/dags
  • Build the Airflow image using the command below
docker build -f Dockerfile.Airflow . -t airflow-spark

With all the images ready, we can now connect Airflow and Spark together. Follow the command below to create the environment variables, create the dags folder, the logs folder, and the plugins folder. Make sure to include the images tagged name you build in the previous section into the docker-compose.yaml file for each of Airflow and Spark.

  • Run the commands below to finish the Airflow and Spark setup
mkdir ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env
  • Our environment variable looks like this
AIRFLOW_UID=33333
AIRFLOW_GID=0
AWS_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXX
AWS_SECRET_KEY=XXXXXXXXXXXXXXXXXXXX
  • Start and run the Spark and Airflow containers
docker-compose -f docker-compose.Spark.yaml -f docker-compose.Airflow.yaml up -d

When all the services all started successfully, now go to http://localhost:8080/ to check that Airflow has started successfully, and http://localhost:8090/ that Spark is up and running.

  • You can test to confirm if the Spark job is completed successfully before you integrate the Spark job with Airflow for scheduling. 

If something should fail at this point please go back to the issue and make sure you solve the problem before you proceed to Airflow. 

docker exec -it spark-jobs-spark-worker-1 \
  spark-submit --master spark://XXXXXXXXXXXXX:7077 \
  spark_etl_script_docker.py
A diagram showing Spark UI with the Spark worker and jobs information
The result of Spark job run from Docker containers
Result of completed Spark job runs from Docker container
A diagram showing Spark job result in S3 bucket

After we have tested to see that our deployment all works as intended, if all is fine with the setup, then we would move forward to scheduling our Spark job on Airflow. Very importantly, the Spark connection on the Airflow UI would look like the following.

Spark connection configuration in Airflow UI
Airflow UI showing completed Spark Job runs.

Final Note

In general, the problems we encountered are dependency issues of various nature. One of these is that Airflow and Spark Python drivers must be of the same version for Airflow and Spark to work properly. I ran into the driver issue during development, the solution is that I upgraded the Python version running in Airflow to that of Spark. Another part of this is that Java must be running in the Airflow Docker container, in our case, we have installed Java 11. But for you to send data to the S3 bucket directly from Airflow, you must have the Java version that meets the requirement of Py4J Python library running on your Airflow instance. 

Another problem you might run into is the Spark dependencies required to send data to S3 bucket or to work with AWS resources. This is very important, make sure that all the dependencies are included in the Jar folder of the $Spark_HOME. Interestingly, one thing you should keep in mind is that the recommended way of sending or retrieving data to/from S3 bucket is using the S3 access point. Make sure you create the resources rightly, you can check this blog that shows you how to create the S3 access point the right way. With all these, you can now set up your own Spark and Airflow deployment on Docker container and let Airflow schedule and run your Spark jobs. You can access all the codes used in this blog post on GitHub.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!


Join Anant's Newsletter

Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!