Data Engineer’s Lunch #54: dbt and Spark

In Data Engineer’s Lunch #54: dbt and Spark, we discussed the data build tool, a tool for managing data transformations with config files rather than code. We connected it to Apache Spark and used it to perform transformations. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!

DBT Overview

dbt is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Essentially, dbt performs in-place data transformations within the data warehouse you are already using to store your data. It uses projects composed of .yml and .sql files to build a dag of simple transformations that combine to create a full transformation pipeline. Dbt Labs, some software vendors, and some user communities maintain adapters for connecting dbt to some of the most used data warehouse technologies and BI tools. These technologies include databases like Postgres, analytics engines like Apache Spark, and BI tools like BigQuery. 

DBT Internals

Dbt connects to your data warehouse through data stored in a profiles.yml file. Each adapter has different requirements for its profile. The profiles.yml files store multiple profiles concurrently but only use one at a time.

Projects contain the yml files and sql files that define transformations and dependencies. They must contain a project file and model files. A dbt_project.yml file tells dbt that a particular directory is a dbt project, and also contains configurations for your project. A model is a single .sql file. Each model contains a single select statement that either transforms raw data into a dataset that is ready for analytics or, more often, is an intermediate step in such a transformation. Project directories and template files are created via dbt init commands.

A model is a select statement. Models are defined in .sql files (typically in your model’s directory). Each .sql file contains just one model / select statement. The name of the file is used as the model name. Models can be nested in subdirectories within the model’s directory. When projects are executed models are placed in sequence in order to form pipelines that can perform complex transformations.

DBT and Spark

The dbt Spark adapter connects to persisted Spark SQL tables and performs transformations, saving the results into new intermediate and final tables. There are 3 possible connection methods depending on what type of Spark deployment you have

odbc is the preferred method when connecting to Databricks. It supports connecting to a SQL Endpoint or an all-purpose interactive cluster. Http is a more generic method for connecting to a managed service that provides an HTTP endpoint. Currently, this includes connections to a Databricks interactive cluster. Thrift connects directly to (a thrift server on) the lead node of a cluster, either locally hosted / on premise or in the cloud (e.g. Amazon EMR).

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!


Join Anant's Newsletter

Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!