Data Engineer’s Lunch #59: Spark Tasks and Distribution

In Data Engineer’s Lunch #59: Spark Tasks and Distribution, we discussed using a machine learning example to investigate the way that Spark distributes work between nodes. . The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!

Spark Architecture

An Apache Spark cluster consists of a number of jvm nodes. These nodes are either driver nodes or worker nodes that, together, coordinate and execute the processing of data. Driver nodes distribute tasks and data to workers, who execute on them. 

A task is the smallest unit of work that gets distributed out to worker nodes. Tasks consist of a single operation like map or filter being executed on a single partition of data. A sequence of tasks that can all be run in parallel without redistributing data is called a stage. If data does need to be redistributed between nodes, a number of stages are lined up and make up a job. Operations that go across data partitions like count and foreach are actions that can trigger jobs.

Shuffle Operations

Shuffle operations involve redistribution of data across the cluster. Since data must be copied between executors and between machines, the shuffle operation can be costly in terms of both network traffic and time. Shuffle operations can be caused by operations including repartition (which redistributed data according to user specifications) and coalesce (which gathers data from multiple partitions) as well as join operations and any ByKey operations like groupByKey and reduceByKey. Shuffle includes elements of disk I/O, data serialization and network I/O work. The shuffle operation actually is set up like a mapReduce program, split into map and reduce steps, individual tasks that organize and then aggregate data from across the cluster.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!


Join Anant's Newsletter

Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!