In Data Engineer’s Lunch #30 Databand, we will discuss data pipeline observation. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Databand is a pipeline observability platform. The application contains services for storing, analyzing, visualizing, and alerting on pipeline metadata. Pipeline metadata includes various run information, like job durations and errors, and data quality metrics, like data counts and completeness. The application stack contains the following components
Databand helps data engineers guarantee reliable SLAs. You can use the solution to monitor runs, alert on failures, and do root cause analysis to find where errors are coming from. You can collect application logs and metadata from your pipelines including:
If you are using an orchestrator such as Apache Airflow, Databand can sync metadata from the Airflow database and provide you deeper insights and utilities for monitoring your DAG health, for example alerting on anomalous runtimes.
If executing jobs in remote or distributed systems like Spark, SQL databases, or docker containers, it may be a challenge to gather the right information from your execution environment and understand how it aligns with your DAGs. This can lead to information silos that slow down debugging/RCA, or even create inconsistencies between systems. Databand can track metadata and logs from task executors, so you can access log and error information in one place.
As more businesses rely on their data products, producing reliable, consistent, and quality data outputs have become critical for many teams.
You can leverage Databand to track, alert on, and investigate problems in data quality, integrity, and access. Databand provides visibility into this information by collecting usage and profiling information about your datasets, as well as providing you the ability to custom define data quality metrics that will be sent to Databand’s tracking system.
When you integrate Databand with your pipelines, Databand can automatically gather metadata about your data sets in use and store that information for analysis. Examples include:
All automatically extracted metadata can be granularly defined and throttled, according to data privacy requirements.
Every organization’s data is different. You are also free to define custom metrics about your data sets through Databand’s API. This of course can be used in addition to, or instead of, automatically extracted metadata. Examples of user-generated metrics include specialized outlier checks or data completeness scores.
t’s easy to get lost on who is working with what data files. This is a problem for the obvious governance reasons, but also for data quality when you need to know how issues in data cascade across an organization. As you integrate Databand the product can track which files or tables are being processed, and attribute those to specific pipelines and pipeline owners.
You can set alerts on all metadata that Databand tracks – everything from run durations to data quality metrics. You can configure alerts programmatically through the Databand UI.
Run alerts cover metadata from the overall pipeline/DAG execution, including overall duration and state (running/succeeded/failed).
Data health alerts cover metadata coming from tasks, operators, or functions from a pipeline. Examples include data profiling metrics like standard dev or mean from a column and custom user-defined metrics.
After a metric is tracked by Databand, you can compare the metric across all run histories from the relevant pipeline, along with other values from that run. Databand will keep all metrics in the context of the pipeline and run where the metric was produced, making it easier for users to correlate metadata and trace the root cause of issues. For example, you can quickly identify how pipeline performance or application errors from a specific run relate to changes in data volumes or resource consumption levels.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!