In Data Engineer’s Lunch #40: Streaming vs. Batch for ETL, we will be discussing use cases for using real-time stream processing or processing in batches. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Streaming data in real-time seems to be the way managing large amounts of data is moving towards, but there are still cases in which performing ETL in batch processes is an acceptable use case. This article is meant to outline streaming vs. batch for ETL. We will discuss what each of the processes is, some of the technologies used, and highlight some of the strengths and weaknesses of each.
In batch processing, data is collected and stored in windows of time. The schedule of these windows varies, but processing usually occurs one to two times daily. Batch tasks can be executed in any order as designated by the workflow. Batch is best used in cases of very large amounts of data that need to have entire sets processed such as sorting or calculating totals and averages.
Streaming executes ETL tasks on data as it is flowing through the pipeline. Each piece of data is processed as soon as it is ingested. Streaming is ideal for data analysis in real-time. Sources that continuously produce data or require immediate detection of anomalies, such as monitoring for fraud, are prime use cases for streaming.
When considering streaming vs. batch for ETL the technologies used are going to be a huge consideration. Cassandra, Spark, and Kafka are some of the principal technologies we utilize here at Anant. In fact, we have multiple blog posts on getting started with and using each of them. One such blog post outlines using Spark, Cassandra, and Elasticsearch for Data Processing.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!