In Data Engineer’s Lunch #21: Python ETL Tools, we discussed, compared, and contrasted a number of python tools that assist in running ETL pipelines. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
ETL tools help data engineers create, manage, and run pipelines that perform data management tasks. One of the most common of these tasks is ETL, which stands for extract, transform, and load. Extract means to take data from somewhere and put it into a manipulable format for later processing. The extract step often needs to interact with data storage technologies such as databases and file formats like csv and JSON.
Transform applies operations to your data in order to make changes that you want. While base python could theoretically be used to perform some of these operations, it is both limited in scope and non-performant on large datasets. ETL tools help with the transform step by having built in functions or classes for data manipulation, making it easier to do common transformations. They may also use an underlying data representation based on a more performant language like Java or C in order to speed up processing. The last thing that ETL tools may do to help with transformation is to distribute portions of the workload, either to different threads on the same machine or out to different machines.
Load is much like the extract step, but in the opposite direction. It involves taking the processed data and pushing back out to some longer term storage medium. This step makes use of those same external connections to file systems and databases that the extract step makes and uses them to load that processed data for later use.
Pygrametl is a python package containing a number of classes for making ETL pipelines. Its data sources class allows access to a number of external data sources, loading that data into dimensions or fact tables, both of which exist as internal data representations in the table’s class. It can parallelize operations, and it also uses Jython to speed up processing by using Java data types instead of python ones. It also has a steps class for defining individual steps within an ETL pipeline.
Pygrametl contains tools for connecting to external data sources, has an internal data representation with a number of common data transformations, can parallelize on a single machine, and includes tools for defining steps within a workflow.
Petl is a general-purpose python package for extracting, transforming, and loading tables of data. This tool has integrations with SQLAlchemy, pandas, and a number of common data storage file types that can help with extracting and loading data. It also contains predefined functions for common transformations and the ability to create new transformations via lambda functions defined by the user.
This ETL tool has tools for connecting to external data sources and has a number of common data transformations included as well as a method for including new transformations.
Pandas provides high-performance data structures as well as data analysis tools that can be used to perform ETL. For extraction pandas allows loading into its internal data representation, the dataframe, from files and other python representations of data. There are methods for loading data from databases into pandas but they are not part of the package. Pandas dataframes have a large number of transformations that can be applied to them. It is also generally easy to define custom ones.
Bonobo is an ETL framework for managing data engineering workflows. Bonobos for directed acyclic graphs of tasks and streams rows of data between them, one at a time. It generally targets smaller scale data processing jobs, not big data. Bonobo has integrations with SQLAlchemy for connecting to external databases, and readers for JSON and CSV file formats. It does not have an internal data representation beyond the row, and no predefined data transformations. Bonobo’s main strength is in the definitions of pipelines, with its ability to represent branches and merges, the set of data processes that Bonobo can represent is much higher than tools that can only define linear workflows.
Pyspark is the python library for interacting with Apache Spark, the analytics engine. It has its own type of data frame, similar to Pandas, that enables data transformation. Interaction with other parts of Spark like SparkSQL gives Pyspark the ability to extract and load data from a variety of sources. The main thing about Spark is that it is a distributed system on a number of machines, giving PySpark the ability to speed up processing by distributing steps out to a number of workers.
Airflow is a Python workflow scheduling tool. It runs directed acyclic graphs of tasks like bonobo but (especially with Airflow 2.0) works on a larger scale. The user has to handle the extraction, loading, transportation between steps, and transformation of the data by themselves, but Airflow’s ability to build DAGs of tasks and schedule their running is second to none.
Luigi is a Python package for managing pipelines of batch jobs. Data is generally passed between tasks in files, but Luigi also has functionality for reading from or loading to databases for the input and out steps of pipelines.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!