In Data Engineer’s Lunch #41: Pygrametl, we discussed PygramETL, a python ETL tool. This is the end for now of our series on them. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
PygramETL is a python ETL tool. It was pitched as an alternative to graphical BI programs that existed at the time. The tool was meant to be as easy to use as the graphical tools. It was also meant to perform much better in terms of processing speed. It can use either CPython or Jython to speed up processing by running on C++ or Java respectively. PygramETL is unique among the tools discussed in this series because it is specialized for a specific type of data sink. While the sink can be any relational database, it must be organized in a particular way. PygramETL uses dimensional modeling to load transformed data into a compatible data warehouse. This needs to be a relational database because dimensional modeling uses foreign keys. They connect fact tables to dimensions, as well as dimensions to each other.
During this series, we have drawn a line between two distinct types of ETL tools. The first kind facilitates the scheduling and handing off of data to external processes. This type helps facilitate ETL but tends not to actually be the best tool for processing data directly. Airflow, Luigi, and partially Bonobo all fall into this category. PygramETL, however, falls into the other category alongside Pandas, Petl, and PySpark. This category contains tools where data is loaded into their internal representation and manipulated with their provided methods. PygramETL holds data in Data Sources and provides the ability to transform those with transforming and filtering sources and Aggregators.
PygramETL uses Data Sources to load data in from external sources. The main sources are SQL source and CSV source. SQL source runs a provided query against an external SQL database and returns the results. In PygramETL, connections to external databases are handled by different types of connections when using Jython or CPython. Jython connections work via JDBC, while CPython ones work via PEP 249. These are then wrapped in a Connection Wrapper in order to provide a uniform interface for PygramETL.
All PygramETL sources contain an iterator over the rows contained within, which are represented as python dictionaries where the column names are the keys and the data is the values. CSV source loads in data from csv files with a provided separator. Pandas source allows PygramETL to work with data stored in Pandas dataframes.
PygramETL offers transforming Data Sources that can apply data transformations to existing Data Sources, regardless of type. MergeJoiningSource, HashJoiningSource, UnionSource, and RoundRobinSource are all ways of combining two other defined data sources. MergeJoiningSource requires sorted data since it maintains sorted order, HashJoiningSource uses hash keys and so doesn’t need to be sorted but does need hashable keys, UnionSource doesn’t load in duplicate rows, and RoundRobinSource interleaves sections of a defined size from each source. Filtering source allows users to enforce a condition on a defined data source. ProcessSource separates data out into batches for external processing. MappingSource and TransformingSource apply functions to data from another source.
These sources can all apply to each other as well, creating data flows by themselves by loading data into a source and modifying with a sequence of other sources. Using these sources rather than transforming data by iterating over their iterators is necessary for the intermediate data transformations in a flow because they don’t exhaust the iterator. Applying a function to each row using TransformingSource returns a source with the function applied and all of the rows ready to iterate over. Whereas, applying a transformation using a for loop leaves the original source’s iterator empty.
PygramETL also provides aggregators that collect data over all of the rows of a source and return values based on that. The ones included by default are Avg, Count, CountDistinct, Max, Min, and Sum. It is also easy for users to define their own aggregator under pygrametl.aggregators.Aggregator. User-defined aggregators need a process method that gets applied to each row individually and a finish method that collects that data and combines it together. For example in a count aggregator, the process method would return one for every row, and then the finish method would add all those 1s together to get the count.
As I mentioned above, PygramETL wants to load data into a data warehouse that has a structure compliant with dimensional modeling. Because of this PygramETL has structures for defining fact tables and dimensions including slowly changing dimensions (in which only a few columns are expected to change regularly) and snowflaked dimensions (where the actual data is stored in tables and higher up dimensions hold a foreign key).
Fact tables are meant to hold only the most business-essential information alongside foreign keys to other tables when the rest of the data gets stored. PygramETL does not handle the creation or maintenance of the schema for these tables in the SQL database but when rows are inserted into a created fact table, PygramETL handles the distribution of that data to the relevant dimensional tables.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!