Most site builders are optimized for one type of user. We shouldn't need to choose a platform based on whether we identify as developers, designers, marketers, no-coders, or low-coders.
People building websites have come to expect a pleasing and intuitive visual editing experience, as evidenced by the success of Squarespace, Wix, Webflow, and others. At the same time, the success of patterns and tools like Jamstack, React, Next.js, and Tailwind have shown us that people developing websites demand a smooth developer experience.
Instead of working on another site builder for a narrow segment of users, we realized the opportunity to create an unparalleled experience for developers and non-developers building great websites together.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
The Cassandra data model can be difficult to understand initially as some terms, similar to those used in the relational world, can have a different meaning here, while others are completely new.
A keyspace is the container for tables in a Cassandra data model. A table is the container for an ordered collection of rows. Rows are made of a primary key plus an ordered set of columns, themselves made of name/value pairs.
There is no need to store a value for every column each time a new row is stored. Cassandra's data model can hold wide rows with lots of columns (up to millions of columns!...) It can also hold many rows with a smaller set of columns.
The primary key is a composite made of a partition key plus an optional set of clustering columns. The partition key is used to determine the nodes on which rows are stored, and it can consist of multiple columns. The clustering columns control how data is sorted within a partition. Cassandra also supports static columns, storing data that is not part of the primary key, but shared by every row in a partition.
When a column is created, a data type is defined to constrain the values stored in that column. Data types include character and numeric types, collections, and user-defined types. Three types of collections can be defined: sets, lists, and maps. A column also has other attributes: timestamps and time-to-live. A timestamp is generated for a column value, each time it is created or updated, to resolve any conflicting change to the value. The time-to-live (TTL) is used to indicate how long to keep the value.
A secondary index is an index on any columns that is not part of the primary key. Since Cassandra partitions data across multiple nodes, each node must maintain its own copy of a secondary index for the rows it stores. Therefore, secondary indexes are not recommended on columns with high cardinality or very low cardinality, or on columns that a frequently updated or deleted.
Joins cannot be performed at the database level. If there is need for a join, either it must be performed at the application level, or preferably, the Cassandra data model should be adapted to create a denormalized table that represents the join results.
Chebotko Diagrams
Dr. Artem Chebotko, a Solution Architect at DataStax, introduced a methodology and notation to capture and visualize Cassandra data models for table schemas and supporting. The methodology is based on the successive steps of traditional relational normalized data modeling: conceptual, logical, physical. It results in Chebotko Diagrams for better visualization.
Hackolade recognizes several novelties brought to light by the Chebotko diagrams and notation, but takes some liberties with them in order to further simplify and streamline the process. As explained in this article, we advocate that conceptual modeling should be replaced by Domain Driven Design, whose approach and terminology are a much better fit with NoSQL concepts. In particular, DDD aggregates map directly to denormalization and data nesting, allowing the elimination of data joins. This direct mapping between DDD bounded context aggregates to nested objects has another sizable advantage: it allows the elimination of the logical data modeling step.
Cassandra Data Modeling Tool
Hackolade has pioneered the field of data modeling for NoSQL databases, introducing a graphical software to perform the schema design of hierarchical and graph structures.
Hackolade is a Cassandra schema design software that dynamically forward-engineers CQL scripts as the user visually builds a Cassandra data model. It can also reverse-engineer an existing Cassandra or DataStax instance to derive the schema so a data modeler or information architect can enrich the model with descriptions, metadata, and constraints.
Retrieving the Cassandra CREATE TABLE CQL script, Hackolade persists the state of the instance data model, and generates HTML documentation of the database schema to serve as a platform for a productive dialog between analysts, designers, architects, developers, and DBAs. The Cassandra schema design tool supports several use cases to help enterprises manage their databases.
Components of a Cassandra Data Model
As a date modeler starts from the application access patterns and query model, physical tables can be created in the tool, including their columns with their properties and constraints.
Entity Relationship Diagram (ERD)
Hackolade lets users visualize a Cassandra schema via an Entity-Relationship Diagram of the physical data model.
Hierarchical view of nested objects
As denormalization is applied, implicit relationships can be documented, user-defined types created and referenced in different places. It is also possible to design JSON hierarchical structures with nested objects. This can be supplemented with detailed descriptions and a log of team comments gathered as the Cassandra data model adapts over time for the schema evolution.
Outputs of a Schema Design Tool for Cassandra
In addition to the dynamic CQL script creation to facilitate development, Hackolade provides a rich, human-readable HTML report, including diagrams, collections, relationships and all their metadata. Many additional features have been developed to help data modelers.
Benefits of Data Modeling
A data model provides a blueprint for applications that closely represent the understanding of complex data-centric enterprises. Hackolade increases data agility by making data structures transparent and facilitating its evolution. The benefits of data modeling for Cassandra are widespread and measurable.
NoSQL schema design is a best practice to ensure that applications evolve, scale, and perform well. A good data model helps reduce development time, increase application quality, and lower execution risks across the enterprise.
Free trial
To experience the first Cassandra data modeling tool and try Hackolade free for 14 days, download the latest version of Hackolade and install it on your desktop. There's no risk, no obligation, and no credit card required! The software runs on Windows, Mac, and Linux, plus it supports several other leading NoSQL databases.
ScyllaDB is an open-source distributed NoSQL column-oriented data store, designed to be compatible with Apache Cassandra. It supports the same CQL query language but is written in C++ instead of Java to increase raw performance and leverage modern multi-code servers self-tuning.
ScyllaDB is also a hybrid between a key-value and a column-oriented database. Rows are organized into tables. The first component of a primary key is a partition key, and rows clustered by the remaining columns of the key. Other columns may be indexed separately from the primary key.
To perform data modeling for SycllaDB with Hackolade, you must first download the ScyllaDB plugin.
Hackolade was specially adapted to support ScyllaDB, following the principles of Cassandra data modeling, including User-Defined Types and the concepts of Partitioning and Clustering keys. It lets users define, document, and display Chebotko physical diagrams. The application closely follows the ScyllaDB terminology, data types, and Chebotko notation.
The data model in the picture below results from the reverse-engineering of a sample application imported in ScyllaDB.
Keyspace
A keyspace is a ScyllaDB namespace that defines data replication on nodes. A cluster contains one keyspace per node. A keyspace is logical grouping of tables analogous to a database in relation database systems.
Table
Tables in ScyllaDB contain rows of columns, and a primary key identifies the location and order of stored data. Tables can also be used to store JSON. Tables are declared up front at schema definition time.
Primary, Partition, and Clustering Keys
In ScyllaDB, primary keys can be simple or compound, with one or more partition keys, and optionally one or more clustering keys. The partition key determines which node stores the data. It is responsible for data distribution across the nodes. The additional columns determine per-partition clustering. Clustering is a storage engine process that sorts data within the partition.
Attributes data types
ScyllaDB supports a variety of scalar and complex data types, including lists, maps, and sets.
Hackolade was specially adapted to support the data types and attributes behavior of ScyllaDB.
Some scalar types can be configured for different modes.
ScyllaDB supports materialized views to handle automated server-side denormalization. In theory, this removes the need for client-side handling and would ensure consistency between base and view data. Materialized views work particularly well with immutable insert-only data, but should not be used in case of low-cardinality data. Materialized views are designed to alleviate the pain for developers, but are essentially a trade-off of performance for connectedness. See more info in this article.
Hackolade supports ScyllaDB materialized views, via a SELECT of columns of the underlying base table, to present the data of the base table with a different primary key for different access patterns.
Forward-Engineering
Hackolade dynamically generates the CQL script to create keyspaces, tables, columns and their data types, and indexes for the structure created with the application.
As many people store JSON within text or blob columns, Hackolade allows for the schema design of those documents. That JSON structure is not forward-engineered in the CQL scrip, but is useful for developers, analysts and designers.
Reverse-Engineering
The connection is established using a connection string including (IP) address and port (typically 9042), and authentication using username/password if applicable.
The Hackolade process for reverse-engineering of ScyllaDB databases includes the execution of cqlsh DESCRIBE statements to discover keyspaces, tables, columns and their types, and indexes. If JSON is detected in string columns, Hackolade performs statistical sampling of records followed by probabilistic inference of the JSON document schema.
For more information on ScyllaDB in general, please consult the website.
Lightning fast throughput and ultra-low latency
ScyllaDB is an open-source distributed NoSQL column-oriented data store, designed to be compatible with Apache Cassandra. It supports the same CQL query language but is written in C++ instead of Java to increase raw performance and leverage modern multi-code servers self-tuning.
Hackolade was specially adapted to support the data modeling of ScyllaDB, including User-Defined Types and the concepts of Partitioning and Clustering keys. It lets users define, document, and display Chebotko physical diagrams. The application closely follows the ScyllaDB terminology, data types, and Chebotko notation.
The reverse-engineering function includes the table definitions, indexes, user-defined types and functions, but also the inference of the schema for JSON structures if detected in text or blob.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Thanks to the changes proposed at CASSANDRA-8717, CASSANDRA-7575 and CASSANDRA-6480, Stratio is glad to present itsLucene-based implementation of Cassandra secondary indexes as a plugin that can be attached to the Apache distribution. Before the above changes, Lucene index was distributed inside a fork of Apache Cassandra, with all the difficulties it implied, i.e. maintaining a fork. As of now, the fork is discontinued and new users should use the recently created plugin, which maintains all the features of Stratio Cassandra.
Stratio’s Lucene index extends Cassandra’s functionality to provide near real-time distributed search engine capabilities such as with ElasticSearch or Solr, including full text search capabilities, free multivariable search,relevance queries and field-based sorting. Each node indexes its own data, so high availability and scalability is guaranteed.
With the new distribution as a plugin you can easily patch an existing installation of Cassandra with the following command:
Once the index has been created, we can start issuing queries. For instance, retrieve 10 tweets containing the word Cassandra in the message body:
SELECT * FROM tweets WHERE lucene = ‘{
filter : {type : "match",
field : "message",
value : "cassandra"}
}’ LIMIT 10;
We can also find the 5 best matching tweets containing the word Cassandra in the message body for the previous query, involving Lucene relevance features:
SELECT * FROM tweets WHERE lucene = ‘{
query: {type: "match",
field : "message",
value : "cassandra"}
}’ LIMIT 5;
Queries, filters and boolean predicates can be combined to do queries as complex as requesting the tweets inside a certain time range starting with an ‘A’ or containing a ‘B’, or with the word ‘FAST’, sorting them by ascending time, and then by descending alphabetic user name:
SELECT * FROM tweets WHERE lucene = '{
filter :
{
type : "boolean", must :
[
{type : "range", field : "createdat", lower : "2014-04-25"},
{type : "boolean", should :
[
{type : "prefix", field : "userid", value : "a"} ,
{type : "wildcard", field : "userid", value : "*b*"} ,
{type : "match", field : "userid", value : "fast"}
]
}
]
},
sort :
{
fields: [ {field :"createdat", reverse : true},
{field : "userid", reverse : false} ]
}
}' LIMIT 10000;
iptables is the packet filtering technology that’s built into the 2.4 Linux kernel. It’s what allows one to do firewalling, nating, and other cool stuff to packets from within Linux. Actually, that’s not quite right — iptables is just the command used to control netfilter, which is the real underlying technology. We’ll just call it iptables though, since that’s how the whole system is usually referred to.
Stateful Packet Inspection
First off, many have heard a number of different definitions of “stateful” firewalls and/or “SPI” protection, and I think it’s worth the time to take a stab at clearing up the ambiguity. “Stateful Inspection” actually gives its true definition away in its name; it’s nothing more and nothing less than attempting to ensure that traffic moving through the firewall is legitimate by determining whether or not it’s part of (or related to) an existing, accepted connection.
When you hear that this firewall or that firewall does “SPI”, it could really mean anything; it’s a big buzzword right now, so every company out there wants to add it to their sales pitch. Remember, the definition is broad, so there can be (and is) a big difference between so-called SPI protection on a $50 SOHO router as compared to what’s offered on something like Check Point FW-1. The former could do a couple TCP-flag checks and call it SPI, while the latter does a full battery of tests. Just keep that in mind; not all SPI is created equal.
Netfilter Basics
iptables is made up of some basic structures, as seen below:
TABLES
CHAINS
TARGETS
TABLES
TABLES are the major pieces of the packet processing system, and they consist of FILTER, NAT, and MANGLE. FILTER is used for the standard processing of packets, and it’s the default table if none other is specified. NAT is used to rewrite the source and/or destination of packets and/or track connections. MANGLE is used to otherwise modify packets, i.e. modifying various portions of a TCP header, etc.
CHAINS
CHAINS are then associated with each table. Chains are lists of rules within a table, and they are associated with “hook points” on the system, i.e. places where you can intercept traffic and take action. Here are the default table/chain combinations:
And here’s when the different chains do their thing:
PREROUTING: Immediately after being received by an interface.
POSTROUTING: Right before leaving an interface.
INPUT: Right before being handed to a local process.
OUTPUT: Right after being created by a local process.
FORWARD: For any packets coming in one interface and leaving out another.
In other words, if you want to process packets as they leave your system, but without doing any NAT or MANGLE(ing), you’ll look to the OUTPUT chain within the FILTER table. If you want to process packets coming from the outside destined for your local machine, you’ll want to use the same FILTER table, but the INPUT chain. See the image below for a visual representation of this.
TARGETS
TARGETS determine what will happen to a packet within a chain if a match is found with one of its rules. A two most common ones are DROP and ACCEPT. So if you want to drop a packet on the floor, you write a rule that matches the particular traffic and then jump to the DROP target. Conversely, if you want to allow something, you jump to the ACCEPT target — simple enough.
How Packets Move
Packets move through netfilter by traversing chains. Each non-empty chain has a list of rules in it, which packets are checked against one after another. If a match is found, the packet is processed. If no match is found, the default action is taken. The default action for a chain is also called its policy. By default, chain policies are to jump to the ACCEPT target, but this can be set to DROP if so desired (I suggest it).
Digging In
So, with that inadequate short intro out of the way, let’s dig into it with some diagrams and a couple of cookbook-style examples :
Figure 1. How Traffic Moves Through Netfilter
Allow Outgoing (Stateful) Web Browsing
iptables -A OUTPUT -o eth0 -p TCP –dport 80 -j ACCEPT iptables -A INPUT -i eth0 -p TCP -m state –state ESTABLISHED,RELATED –sport 80 -j ACCEPT
In the first rule, we’re simply adding (appending) a rule to the OUTPUT chain for protocol TCP and destination port 80 to be allowed. We are also specifying that the incoming packets will need to exit the machine over interface eth0 (-o is for “output”) in order to trigger this rule; this interface designation is important when you start dealing with machines with multiple interfaces. You can also add additional checks beyond those seen here, such as what source ports are allowed, etc., but we’ll keep it simple for the examples here. The second rule allows the web traffic to come back (an important part of browsing).
Notice the “state” stuff; that’s what makes netfilter a “stateful” firewalling technology. Packets are not able to move through this rule and get back to the client unless they were created via the rule above it, i.e. they have to be part of an established or related connection, and be coming from a source port of 80 — which is usually a web server. Again, you can add more checks here, but you get the point.
Here, we’re appending (-A) to the output (OUTPUT) chain, using the icmp (-p) protocol, of type echo-request (–icmp-type echo request), and jumping (-j) to the ACCEPT target (which means ACCEPT it, strangely enough). That’s for the outgoing piece. For the return packet, we append to the INPUT chain instead of OUTPUT, and allow echo-reply instead of echo-request. This, of course, means that incoming echo-requests to your box will be dropped, as will outgoing replies.
“Passing Ports” Into A NATd Network
One of the most commonly-used functions of firewall devices is “passing ports” inside to other private, hidden machines on your network running services such as web and mail. In corporate environments this is out of necessity, and at home it’s often for gaming or in a hobbyist context. Either way, here’s how you do it with Netfilter/IPTABLES:
If we break this down, we see that we’re actually using the nat table here rather than not specifying one. Remember, if nothing is mentioned as far as tables are concerned, you’re using the filter table by default. So in this case we’re using the nat table and appending a rule to the PREROUTING chain. If you recall from the diagram above, the PREROUTING chain takes effect right after being received by an interface from the outside.
This is where DNAT occurs. This means that destination translation happens before routing, which is good to know. So, we then see that the rule will apply to the TCP protocol for all packets destined for port 25 on the public IP. From there, we jump to the DNAT target (Destination NAT), and “jump to” (–to) our internal IP on port 25. Notice that the syntax of the internal destination is IP:PORT.
Ah, but this is only half of the work. If you have any experience with corporate-class firewalls such as Check Point or Astaro, you know there are two parts to enabling connectivity like this — the NAT portion, and the rules portion. Below is what we need to get the traffic through the firewall:
In other words, if you just NAT the traffic, it’s not ever going to make it through your firewall; you have to pass it through the rulebase as well. Notice that we’re accepting the packets in from the first interface, and allowing them out the second. Finally, we’re specifying that only traffic to destination port (–dport) 25 (TCP) is allowed — which matches our NAT rule.
The key here is that you need two things in order to pass traffic inside to your hidden servers — NAT and rules.
Conclusion
Ok, so that about does it for now. I have obviously only scratched the surface here, but hopefully I’ve covered the very basics in a way that can help someone. I intend to keep adding to this as time goes on, both for the sake of being thorough and to clarify things as needed. If, however, I have missed something major, or made some sort of error, do feel free to contact me and set me straight. Also, be sure to check out the manpage as well as the links below.
“I'll bet the service saved me a couple hours” PCWorld
“Ninite.com frees up your day” The Christian Science Monitor
“This post can be fairly short because Ninite works exactly as advertised.” Lifehacker
As of February 14th, 2019 Ninite has ended support for Windows XP and Windows Vista as well as the related server platforms Server 2003 and Server 2008.
You'll need to upgrade your Windows version to continue using Ninite.
This is part 2 of a Cassandra Cluster Tutorial series. Part 1 used Vagrant to setup a local Cassandra Cluster and installs Cassandra on boxes. Part 2 installs Cassandra Database SSL support and configures Cassandra nodes with SSL. Later parts of this Cassandra Cluster tutorial series will setup Ansible/ssh for DevOps/DBA tasks, use Packer to create EC2 AMIs and instances, and setup a Cassandra cluster in EC2.
The Cassandra database allows you to secure the client transport (cqlsh) as well as the cluster transport cluster transport (storage transport).
Remember that SSL and TLS have some overhead. This is especially true in the JVM world, which is not as performant for handling SSL/TLS unless you are using Netty/OpenSSL integration. If possible, use no encryption for the cluster transport (storage transport), and deploy your Cassandra nodes in a private subnet, and limit access to this subnet to the client transport. Also, if possible, avoid using TLS/SSL on the client transport and do client operations from your app tier, which is located in a non-public subnet.
However, it is not always possible to avoid using SSL/TLS. You may work in an industry that requires the use of encrypted transports based on regulations like the U.S. Health Insurance Portability and Accountability Act (HIPAA), the Payment Card Industry Data Security Standard (PCI DSS), or the U.S. Sarbanes-Oxley Act. Or you might work for a bank or other financial institution. Or it just might be a corporate policy to encrypt such network transports — even for internal networks.
An area of concern is for compliance is authorization and encrypted data at rest. Cassandra has essential security features for authentication, role-based authorization, transport encryption (JMX, client transport, cluster transport), as well as data at rest encryption (encrypting SSTables).
This article will focus just on setting up encryption for the Cassandra client transport (cqlsh) and the cluster transport. Later articles will cover various aspects of compliance and encryption.
Encrypting the Cassandra Database Transports
Data that travels over the client transport across a network could be accessed by someone you don’t want accessing said data with tools like wire shark. If data includes private information, like SSN numbers, credentials (password, username), credit card numbers or account numbers, then we want to make that data unreadable to any and all 3rd parties. This is especially important if we don’t control the network. You can also use TLS to make sure the data has not been tampered with whilst traveling the network. The Secure Sockets Layer (SSL) / Transport Layer Security (TLS) protocol are designed to provide these features (SSL is the old name for what became TLS but many people still refer to TLS as SSL).
Cassandra is written in Java. Java defines the JSSE framework which in turn uses the Java Cryptography Architecture (JCA). JSSE uses cryptographic service providers from JCA. If any of the above is new to you, please take a few minutes to read through the TLS/SSL Java guide. It does a good job explaining keystores vs. trust stores.
The client transport encryption protects data as it moves from clients to server nodes in the cluster.
The client_encryption_options are stored in the cassandra.yaml. Here is an example config.
Cassandra YAML: Sample Config
# enable or disable client/server encryption.
client_encryption_options:
enabled: false
# If enabled and optional is set to true encrypted and unencrypted connections are handled.
optional: false
keystore: conf/.keystore
keystore_password: cassandra
require_client_auth: false
# Set trustore and truststore_password if require_client_auth is true
# truststore: conf/.truststore
# truststore_password: cassandra
protocol: TLS
algorithm: SunX509
store_type: JKS
cipher_suites: [TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDH_anon_WITH_AES_256_CBC_SHA]
Cassandra Cluster: Setup SSL Keys
Before we go into the details of setting up the cassandra.yaml file, let’s create some trust stores, key stores, and export some keys with keytool (utility that ships with JDK). The following script generates cluster and client keys.
setupkeys-cassandra-security.sh - creates encyption keys for Cassandra
#!/bin/bash
KEY_STORE_PATH="$PWD/resources/opt/cassandra/conf/certs"
mkdir -p "$KEY_STORE_PATH"
KEY_STORE="$KEY_STORE_PATH/cassandra.keystore"
PKS_KEY_STORE="$KEY_STORE_PATH/cassandra.pks12.keystore"
TRUST_STORE="$KEY_STORE_PATH/cassandra.truststore"
PASSWORD=cassandra
CLUSTER_NAME=test
CLUSTER_PUBLIC_CERT="$KEY_STORE_PATH/CLUSTER_${CLUSTER_NAME}_PUBLIC.cer"
CLIENT_PUBLIC_CERT="$KEY_STORE_PATH/CLIENT_${CLUSTER_NAME}_PUBLIC.cer"
### Cluster key setup.
# Create the cluster key for cluster communication.
keytool -genkey -keyalg RSA -alias "${CLUSTER_NAME}_CLUSTER" -keystore "$KEY_STORE" -storepass "$PASSWORD" -keypass "$PASSWORD" \
-dname "CN=CloudDurable Image $CLUSTER_NAME cluster, OU=Cloudurable, O=Cloudurable, L=San Francisco, ST=CA, C=USA, DC=cloudurable, DC=com" \
-validity 36500
# Create the public key for the cluster which is used to identify nodes.
keytool -export -alias "${CLUSTER_NAME}_CLUSTER" -file "$CLUSTER_PUBLIC_CERT" -keystore "$KEY_STORE" \
-storepass "$PASSWORD" -keypass "$PASSWORD" -noprompt
# Import the identity of the cluster public cluster key into the trust store so that nodes can identify each other.
keytool -import -v -trustcacerts -alias "${CLUSTER_NAME}_CLUSTER" -file "$CLUSTER_PUBLIC_CERT" -keystore "$TRUST_STORE" \
-storepass "$PASSWORD" -keypass "$PASSWORD" -noprompt
### Client key setup.
# Create the client key for CQL.
keytool -genkey -keyalg RSA -alias "${CLUSTER_NAME}_CLIENT" -keystore "$KEY_STORE" -storepass "$PASSWORD" -keypass "$PASSWORD" \
-dname "CN=CloudDurable Image $CLUSTER_NAME client, OU=Cloudurable, O=Cloudurable, L=San Francisco, ST=CA, C=USA, DC=cloudurable, DC=com" \
-validity 36500
# Create the public key for the client to identify itself.
keytool -export -alias "${CLUSTER_NAME}_CLIENT" -file "$CLIENT_PUBLIC_CERT" -keystore "$KEY_STORE" \
-storepass "$PASSWORD" -keypass "$PASSWORD" -noprompt
# Import the identity of the client pub key into the trust store so nodes can identify this client.
keytool -importcert -v -trustcacerts -alias "${CLUSTER_NAME}_CLIENT" -file "$CLIENT_PUBLIC_CERT" -keystore "$TRUST_STORE" \
-storepass "$PASSWORD" -keypass "$PASSWORD" -noprompt
keytool -importkeystore -srckeystore "$KEY_STORE" -destkeystore "$PKS_KEY_STORE" -deststoretype PKCS12 \
-srcstorepass "$PASSWORD" -deststorepass "$PASSWORD"
openssl pkcs12 -in "$PKS_KEY_STORE" -nokeys -out "${CLUSTER_NAME}_CLIENT.cer.pem" -passin pass:cassandra
openssl pkcs12 -in "$PKS_KEY_STORE" -nodes -nocerts -out "${CLUSTER_NAME}_CLIENT.key.pem" -passin pass:cassandra
The keytool utility ships with Java SDK. We use this keytool command to create the cluster key. Let’s break down the script that generates the keys and certificates.
First, we create a key store and add to that keystore our new cluster key. The keystore will contain all of the details about our key, and we can generate public keys, certificates, etc. from it.
Once we create the Cassandra cluster key, we will want to export a public key from it. The public key can be used to identify and validate node members.
Export a Public Key for the Cassandra Cluster Key
# Create the public key for the client to identify itself.
keytool -export -alias "${CLUSTER_NAME}_CLIENT" -file "$CLIENT_PUBLIC_CERT" -keystore "$KEY_STORE" \
-storepass "$PASSWORD" -keypass "$PASSWORD" -noprompt
Then we will import the public key into the Cassandra trust store so that nodes can identify each other. The provision script will copy the Cassandra keystore and truststore to the various nodes. If we wanted to deploy additional keys, we could use a tool like Ansible or scp (secure copy) to add the keys to truststores on various nodes (we cover ansible in detail in other tutoirals).
Import Public Key for the Cassandra Cluster Key Into the Truststore so Cassandra Nodes Can Identify Each Other
# Import the identity of the cluster public cluster key into the trust store so that nodes can identify each other.
keytool -import -v -trustcacerts -alias "${CLUSTER_NAME}_CLUSTER" -file "$CLUSTER_PUBLIC_CERT" -keystore "$TRUST_STORE" \
-storepass "$PASSWORD" -keypass "$PASSWORD" -noprompt
We perform the same three tasks for the client keys (create a key, export a public key, and add the public key to the truststore). Then lastly, we create PEM files for the Cassandra client keys by exporting our Java JKS keystore as a PKCS12 keystore.
The Cassandra truststore is used to identify nodes and clients the Cassandra cluster nodes trust. You don't have to use a truststore with clients, you could use username and password (or both).
Next, we want to create PEM files to use with csqlsh (Cassandra client for cql). The Java keystore uses the JKS format, which is specific to Java. In order to convert our keys to the PEM format (more widely used format), we first must copy our JDK formatted keystore to a PKS12 formatted keystore (PKS12 is a standard keystore format). Then we use openssl to extract private and public keys from the PKS12 keytore.
As part of the provision script for cassandra_image. We added the following to the Cassandra image project to improve the Cassandra install to work with these SSL keys:
scripts/040-install-certs.sh - Install Certs into Cassandra
#!/bin/bash
set -e
DESTINATION_DIRECTORY=/opt/cassandra/conf/certs
SOURCE_DIRECTORY="~/resources$DESTINATION_DIRECTORY"
if [ -d "$SOURCE_DIRECTORY" ]; then
mkdir -p "$DESTINATION_DIRECTORY"
cp -r "$SOURCE_DIRECTORY" "$DESTINATION_DIRECTORY"
fi
if [ ! -d "$SOURCE_DIRECTORY" ]; then
echo "UNABLE TO INSTALL CERTS AS THEY WERE NOT FOUND"
fi
This will copy the certs to the right location if you generated a folder in resources (cassandra_image/resources/opt/cassandra/conf/cert), which the last script that we covered does.
Configure Cassandra to use keys in Cassandra Config
/opt/cassandra/conf/cassandra.yaml
server_encryption_options:
internode_encryption: all
keystore: /opt/cassandra/conf/certs/cassandra.keystore
keystore_password: cassandra
truststore: /opt/cassandra/conf/certs/cassandra.truststore
truststore_password: cassandra
# More advanced defaults below:
protocol: TLS
client_encryption_options:
enabled: true
# If enabled and optional is set to true encrypted and unencrypted connections are handled.
optional: false
keystore: /opt/cassandra/conf/certs/cassandra.keystore
keystore_password: cassandra
truststore: /opt/cassandra/conf/certs/cassandra.truststore
truststore_password: cassandra
require_client_auth: true
protocol: TLS
Now let’s test it. We can log into one of our nodes and use nodetool to describe the cluster. If it is successful, we will see all three nodes.
Testing That our Cassandra Nodes Can Talk to Each Other
$ vagrant up
# Get a coffee and otherwise relax for a minute.
# Now log into one of the nodes.
$ vagrant ssh node0
# Now check to see if the cluster is formed.
[vagrant@localhost ~]$ /opt/cassandra/bin/nodetool describecluster
Cluster Information:
Name: test
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
86afa796-d883-3932-aa73-6b017cef0d19: [192.168.50.4, 192.168.50.5, 192.168.50.6]
We now use this SSL keys with a local cluster, but later tutorials in this series will cover using these same keys with AWS instances running in AWS cloud. We will use Packer to create AMIs and Cloudformation to create cluster infrastructure like VPCs, subnets, etc.
Set Up the Cassandra csqlsh Client
This part we are doing on a MacBook Pro running OSX — aka my client machine (could be Linux or Windows). In this example, we have the virtual machines running CentOS 7 with Vagrant on VirtualBox. We can connect to those instances with Cassandracqlsh.
To connect with cqlsh, we will need to setup our keys on the client machine.
Let’s copy cert files so we can access them from the client (MacBook pro / OSX).
Copy Cert Files Created Earlier to install for Cassandra Client
$ cd ~/github/cassandra-image/resources/opt/cassandra/conf/certs
$ mkdir /opt/cassandra/conf/certs
$ cp * /opt/cassandra/conf/certs
Now we will create a cqlshrc, which is a Cassadra config file that dictates how the client (csql) connects to Cassandra.
First, we create the cqlshrc file in ~/.cassandra.
Cassandra Config for client: Create the cqlshrc in ~/.cassandra
$ mkdir ~/.cassandra
$ cd ~/.cassandra
$ touch cqlshrc
# edit this file
Next, we edit the Cassandra client config file and add the following.
~/.cassandra/cqlshrc Contents for Cassandra Configuration for client
[connection]
hostname = 192.168.50.4
port = 9042
factory = cqlshlib.ssl.ssl_transport_factory
[ssl]
certfile = /opt/cassandra/conf/certs/test_CLIENT.cer.pem
validate = false
# Next 2 lines must be provided when require_client_auth = true in the cassandra.yaml file
userkey = /opt/cassandra/conf/certs/test_CLIENT.key.pem
usercert = /opt/cassandra/conf/certs/test_CLIENT.cer.pem
Note that we specify the Cassandra nodes, and we are using the PEM file as our credentials via SSL to prove who we are (a valid client) instead of a username/password. (We could use both username/password and usercert/userkey.)
We need the userkey and usercert in the cqlshrc because we set require_client_auth = true in the cassandra.yaml file for the cluster nodes.
Now let’s test that the client connection works with SSL via cqlsh.
Testing Cassandra Client Connection Using cqlsh
$ /opt/cassandra/bin/cqlsh --ssl
Connected to test at 192.168.50.4:9042.
[cqlsh 5.0.1 | Cassandra 3.9 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
We set up keys for client and clustering. We deployed keys to three Linux boxes using Vagrant provisioning. We then setup cqlsh to use SSL. We then logged into one of the nodes and checked that the network was setup with the nodetool describecluster. Then we locally setup csqlsh to connect to the cluster using SSL.
More to Come from Cassandra Cluster Tutorial series
Airflow Breeze is an easy-to-use development and test environment using Docker Compose. The environment is available for local use and is also used in Airflow's CI tests.
We call it Airflow Breeze as It's a Breeze to contribute to Airflow.
The advantages and disadvantages of using the Breeze environment vs. other ways of testing Airflow are described in CONTRIBUTING.rst.
Note
We are currently migrating old Bash-based ./breeze-legacy to the Python-based breeze. Some of the commands are already converted to breeze, but some old commands should use breeze-legacy. The documentation mentions when ./breeze-legacy is involved.
The new breeze after installing is available on your PATH and you should launch it simply as breeze <COMMAND> <FLAGS>. Previously you had to prepend breeze with ./ but this is not needed any more. For convenience, we will keep ./breeze script for a while to run the new breeze and you can still use the legacy Breeze with ./breeze-legacy.
Watch the video below about Airflow Breeze. It explains the motivation for Breeze and screencast all its uses. The video describes old ./breeze-legacy (in video it still called ./breeze ).
Version: Install the latest stable Docker Desktop and add make sure it is in your PATH. Breeze detects if you are using version that is too old and warns you to upgrade.
Permissions: Configure to run the docker commands directly and not only via root user. Your user should be in the docker group. See Docker installation guide for details.
Disk space: On macOS, increase your available disk space before starting to work with the environment. At least 20 GB of free disk space is recommended. You can also get by with a smaller space but make sure to clean up the Docker disk space periodically. See also Docker for Mac - Space for details on increasing disk space available for Docker on Mac.
Docker problems: Sometimes it is not obvious that space is an issue when you run into a problem with Docker. If you see a weird behaviour, try breeze cleanup command. Also see pruning instructions from Docker.
Here is an example configuration with more than 200GB disk space for Docker:
Version: Install the latest stable Docker Compose and add it to the PATH. Breeze detects if you are using version that is too old and warns you to upgrade.
Permissions: Configure permission to be able to run the docker-compose command by your user.
Accessing the host Windows filesystem incurs a performance penalty, it is therefore recommended to do development on the Linux filesystem. E.g. Run cd ~ and create a development folder in your Linux distro home and git pull the Airflow repo there.
WSL 2 Docker mount errors:
Another reason to use Linux filesystem, is that sometimes - depending on the length of your path, you might get strange errors when you try start Breeze, such as caused: mount through procfd: not a directory: unknown:. Therefore checking out Airflow in Windows-mounted Filesystem is strongly discouraged.
WSL 2 Docker volume remount errors:
If you're experiencing errors such as ERROR: for docker-compose_airflow_run Cannot create container for service airflow: not a directory when starting Breeze after the first time or an error like docker: Error response from daemon: not a directory. See 'docker run --help'. when running the pre-commit tests, you may need to consider installing Docker directly in WSL 2 instead of using Docker Desktop for Windows.
WSL 2 Memory Usage :
WSL 2 can consume a lot of memory under the process name "Vmmem". To reclaim the memory after development you can:
On the Linux distro clear cached memory: sudo sysctl -w vm.drop_caches=3
If no longer using Docker you can quit Docker Desktop (right click system try icon and select "Quit Docker Desktop")
If no longer using WSL you can shut it down on the Windows Host with the following command: wsl --shutdown
Developing in WSL 2:
You can use all the standard Linux command line utilities to develop on WSL 2. Further VS Code supports developing in Windows but remotely executing in WSL. If VS Code is installed on the Windows host system then in the WSL Linux Distro you can run code . in the root directory of you Airflow repo to launch VS Code.
We are using pipx tool to install and manage Breeze. The pipx tool is created by the creators of pip from Python Packaging Authority
Install pipx
pip install --user pipx
Breeze, is not globally accessible until your PATH is updated. Add <USER FOLDER>.localbin as a variable environments. This can be done automatically by the following command (follow instructions printed).
Minimum 4GB RAM for Docker Engine is required to run the full Breeze environment.
On macOS, 2GB of RAM are available for your Docker containers by default, but more memory is recommended (4GB should be comfortable). For details see Docker for Mac - Advanced tab.
On Windows WSL 2 expect the Linux Distro and Docker containers to use 7 - 8 GB of RAM.
Minimum 40GB free disk space is required for your Docker Containers.
On Mac OS This might deteriorate over time so you might need to increase it or run breeze cleanup periodically. For details see Docker for Mac - Advanced tab.
You may need to clean up your Docker environment occasionally. The images are quite big (1.5GB for both images needed for static code analysis and CI tests) and, if you often rebuild/update them, you may end up with some unused image data.
To clean up the Docker environment:
Stop Breeze with breeze stop. (If Breeze is already running)
Run the breeze cleanup command.
Run docker images --all and docker ps --all to verify that your Docker is clean.
Both commands should return an empty list of images and containers respectively.
If you run into disk space errors, consider pruning your Docker images with the docker system prune --all command. You may need to restart the Docker Engine before running this command.
In case of disk space errors on macOS, increase the disk space available for Docker. See Prerequisites for details.
Run this command to install Breeze (make sure to use -e flag):
pipx install -e ./dev/breeze
Once this is complete, you should have breeze binary on your PATH and available to run by breeze command.
Those are all available commands for Breeze and details about the commands are described below:
Breeze installed this way is linked to your checked out sources of Airflow so Breeze will automatically use latest version of sources from ./dev/breeze. Sometimes, when dependencies are updated breeze commands with offer you to self-upgrade (you just need to answer y when asked).
You can always run such self-upgrade at any time:
breeze self-upgrade
Those are all available flags of self-upgrade command:
If you have several checked out Airflow sources, Breeze will warn you if you are using it from a different source tree and will offer you to re-install from those sources - to make sure that you are using the right version.
You can skip Breeze's upgrade check by setting SKIP_BREEZE_UPGRADE_CHECK variable to non empty value.
By default Breeze works on the version of Airflow that you run it in - in case you are outside of the sources of Airflow and you installed Breeze from a directory - Breeze will be run on Airflow sources from where it was installed.
You can run breeze version command to see where breeze installed from and what are the current sources that Breeze works on
Those are all available flags of version command:
The First time you run Breeze, it pulls and builds a local version of Docker images. It pulls the latest Airflow CI images from the GitHub Container Registry and uses them to build your local Docker images. Note that the first run (per python) might take up to 10 minutes on a fast connection to start. Subsequent runs should be much faster.
Once you enter the environment, you are dropped into bash shell of the Airflow container and you can run tests immediately.
To use the full potential of breeze you should set up autocomplete. The breeze command comes with a built-in bash/zsh/fish autocomplete setup command. After installing, when you start typing the command, you can use <TAB> to show all the available switches and get auto-completion on typical values of parameters that you can use.
You should set up the autocomplete option automatically by running:
breeze setup-autocomplete
You get the auto-completion working when you re-enter the shell (follow the instructions printed). The command will warn you and not reinstall autocomplete if you already did, but you can also force reinstalling the autocomplete via:
breeze setup-autocomplete --force
Those are all available flags of setup-autocomplete command:
When you enter the Breeze environment, automatically an environment file is sourced from files/airflow-breeze-config/variables.env.
You can also add files/airflow-breeze-config/init.sh and the script will be sourced always when you enter Breeze. For example you can add pip install commands if you want to install custom dependencies - but there are no limits to add your own customizations.
The files folder from your local sources is automatically mounted to the container under /files path and you can put there any files you want to make available for the Breeze container.
You can also copy any .whl or .sdist packages to dist and when you pass --use-packages-from-dist flag as wheel or sdist line parameter, breeze will automatically install the packages found there when you enter Breeze.
You can also add your local tmux configuration in files/airflow-breeze-config/.tmux.conf and these configurations will be available for your tmux environment.
There is a symlink between files/airflow-breeze-config/.tmux.conf and ~/.tmux.conf in the container, so you can change it at any place, and run
tmux source ~/.tmux.conf
inside container, to enable modified tmux configurations.
Breeze helps with running tests in the same environment/way as CI tests are run. You can run various types of tests while you enter Breeze CI interactive environment - this is described in detail in TESTING.rst
Here is the part of Breeze video which is relevant (note that it refers to the old ./breeze-legacy command and it is not yet available in the new breeze command):
You can use additional breeze flags to choose your environment. You can specify a Python version to use, and backend (the meta-data database). Thanks to that, with Breeze, you can recreate the same environments as we have in matrix builds in the CI.
For example, you can choose to run Python 3.7 tests with MySQL as backend and with mysql version 8 as follows:
breeze --python 3.7 --backend mysql --mysql-version 8
The choices you make are persisted in the ./.build/ cache directory so that next time when you use the breeze script, it could use the values that were used previously. This way you do not have to specify them when you run the script. You can delete the .build/ directory in case you want to restore the default settings.
You can see which value of the parameters that can be stored persistently in cache marked with >VALUE< in the help of the commands.
Another part of configuration is enabling/disabling cheatsheet, asciiart. The cheatsheet and asciiart can be disabled - they are "nice looking" and cheatsheet contains useful information for first time users but eventually you might want to disable both if you find it repetitive and annoying.
With the config setting colour-blind-friendly communication for Breeze messages. By default we communicate with the users about information/errors/warnings/successes via colour-coded messages, but we can switch it off by passing --no-colour to config in which case the messages to the user printed by Breeze will be printed using different schemes (italic/bold/underline) to indicate different kind of messages rather than colours.
Here is the part of Breeze video which is relevant (note that it refers to the old ./breeze-legacy command but it is very similar to current breeze command):
Those are all available flags of config command:
You can also dump hash of the configuration options used - this is mostly use to generate the dump of help of the commands only when they change.
This documentation contains exported images with "help" of their commands and parameters. You can regenerate all those images (which might be needed in case new version of rich is used) via regenerate-command-images command.
Airflow webserver needs to prepare www assets - compiled with node and yarn. The compile-www-assets command takes care about it. This is needed when you want to run webserver inside of the breeze.
For testing Airflow oyou often want to start multiple components (in multiple terminals). Breeze has built-in start-airflow command that start breeze container, launches multiple terminals using tmux and launches all Airflow necessary components in those terminals.
When you are starting airflow from local sources, www asset compilation is automatically executed before.
breeze --python 3.7 --backend mysql start-airflow
You can also use it to start any released version of Airflow from PyPI with the --use-airflow-version flag.
breeze --python 3.7 --backend mysql --use-airflow-version 2.2.5 start-airflow
Those are all available flags of start-airflow command:
If you are having problems with the Breeze environment, try the steps below. After each step you can check whether your problem is fixed.
If you are on macOS, check if you have enough disk space for Docker (Breeze will warn you if not).
Stop Breeze with breeze stop.
Delete the .build directory and run breeze build-image.
Clean up Docker images via breeze cleanup command.
Restart your Docker Engine and try again.
Restart your machine and try again.
Re-install Docker Desktop and try again.
In case the problems are not solved, you can set the VERBOSE_COMMANDS variable to "true":
export VERBOSE_COMMANDS="true"
Then run the failed command, copy-and-paste the output from your terminal to the Airflow Slack #airflow-breeze channel and describe your problem.
Airflow Breeze is a bash script serving as a "swiss-army-knife" of Airflow testing. Under the hood it uses other scripts that you can also run manually if you have problem with running the Breeze environment. Breeze script allows performing the following tasks:
Those are commands mostly used by contributors:
Execute arbitrary command in the test environment with breeze shell command
Enter interactive shell in CI container when shell (or no command) is specified
Start containerised, development-friendly airflow installation with breeze start-airflow command
Compile www assets for webserver breeze compile-www-assets command
Build documentation with breeze build-docs command
Initialize local virtualenv with ./scripts/tools/initialize_virtualenv.py command
Run static checks with autocomplete support breeze static-checks command
Run test specified with breeze tests command
Run docker-compose tests with breeze docker-compose-tests command.
Build CI docker image with breeze build-image command
Cleanup breeze with breeze cleanup command
Additional management tasks:
Join running interactive shell with breeze exec command
Stop running interactive environment with breeze stop command
Execute arbitrary docker-compose command with ./breeze-legacy docker-compose command
You can simply enter the breeze container and run pytest command there. You can enter the container via just breeze command or breeze shell command (the latter has more options useful when you run integration or system tests). This is the best way if you want to interactively run selected tests and iterate with the tests. Once you enter breeze environment it is ready out-of-the-box to run your tests by running the right pytest command (autocomplete should help you with autocompleting test name if you start typing pytest tests<TAB>).
You can re-run the tests interactively, add extra parameters to pytest and modify the files before re-running the test to iterate over the tests. You can also add more flags when starting the breeze shell command when you run integration tests or system tests. Read more details about it in the TESTING.rst <TESTING.rst#> where all the test types of our are explained and more information on how to run them.
You can also run tests via built-in breeze tests command - similarly as iterative pytest command allows to run test individually, or by class or in any other way pytest allows to test them, but it also allows to run the tests in the same test "types" that are used to run the tests in CI: Core, Always API, Providers. This how our CI runs them - running each group in parallel to other groups and you can replicate this behaviour.
Another interesting use of the breeze tests command is that you can easily specify sub-set of the tests for Providers. breeze tests --test-type "Providers[airbyte,http] for example will only run tests for airbyte and http providers.
Here is the detailed set of options for the breeze tests command.
The image building is usually run for users automatically when needed, but sometimes Breeze users might want to manually build, pull or verify the CI images.
Build CI docker image with breeze build-image command
Pull CI images in parallel breeze pull-image command
Users can also build Production images when they are developing them. However when you want to use the PROD image, the regular docker build commands are recommended. See building the image
Build PROD image with breeze build-prod-image command
Pull PROD image in parallel breeze pull-prod-image command
Maintainers also can use Breeze for other purposes (those are commands that regular contributors likely do not need or have no access to run). Those are usually connected with releasing Airflow:
Prepare cache for CI: breeze build-image --prepare-build-cache and ``breeze build-prod image --prepare-build-cache``(needs buildx plugin and write access to registry ghcr.io)
Generate constraints with breeze generate-constraints (needed when conflicting changes are merged)
Verify providers: breeze verify-provider-packages (when releasing provider packages) - including importing the providers in an earlier airflow version.
Breeze keeps data for all it's integration in named docker volumes. Each backend and integration keeps data in their own volume. Those volumes are persisted until breeze stop command. You can also preserve the volumes by adding flag --preserve-volumes when you run the command. Then, next time when you start Breeze, it will have the data pre-populated.
Breeze uses docker images heavily and those images are rebuild periodically. This might cause extra disk usage by the images. If you need to clean-up the images periodically you can run breeze cleanup command (by default it will skip removing your images before cleaning up but you can also remove the images to clean-up everything by adding --all).
Often if you want to run full airflow in the Breeze environment you need to launch multiple terminals and run airflow webserver, airflow scheduler, airflow worker in separate terminals.
This can be achieved either via tmux or via exec-ing into the running container from the host. Tmux is installed inside the container and you can launch it with tmux command. Tmux provides you with the capability of creating multiple virtual terminals and multiplex between them. More about tmux can be found at tmux GitHub wiki page . Tmux has several useful shortcuts that allow you to split the terminals, open new tabs etc - it's pretty useful to learn it.
Here is the part of Breeze video which is relevant:
Another way is to exec into Breeze terminal from the host's terminal. Often you can have multiple terminals in the host (Linux/MacOS/WSL2 on Windows) and you can simply use those terminals to enter the running container. It's as easy as launching breeze exec while you already started the Breeze environment. You will be dropped into bash and environment variables will be read in the same way as when you enter the environment. You can do it multiple times and open as many terminals as you need.
Here is the part of Breeze video which is relevant:
To shrink the Docker image, not all tools are pre-installed in the Docker image. But we have made sure that there is an easy process to install additional tools.
Additional tools are installed in /files/bin. This path is added to $PATH, so your shell will automatically autocomplete files that are in that directory. You can also keep the binaries for your tools in this directory if you need to.
Installation scripts
For the development convenience, we have also provided installation scripts for commonly used tools. They are installed to /files/opt/, so they are preserved after restarting the Breeze environment. Each script is also available in $PATH, so just type install_<TAB> to get a list of tools.
When Breeze starts, it can start additional integrations. Those are additional docker containers that are started in the same docker-compose command. Those are required by some of the tests as described in TESTING.rst#airflow-integration-tests.
By default Breeze starts only airflow container without any integration enabled. If you selected postgres or mysql backend, the container for the selected backend is also started (but only the one that is selected). You can start the additional integrations by passing --integration flag with appropriate integration name when starting Breeze. You can specify several --integration flags to start more than one integration at a time. Finally you can specify --integration all to start all integrations.
Once integration is started, it will continue to run until the environment is stopped with breeze stop command. or restarted via breeze restart command
Note that running integrations uses significant resources - CPU and memory.
Here is the part of Breeze video which is relevant (note that it refers to the old ./breeze-legacy command but it is very similar to current breeze command):
With Breeze you can build images that are used by Airflow CI and production ones.
For all development tasks, unit tests, integration tests, and static code checks, we use the CI image maintained in GitHub Container Registry.
The CI image is built automatically as needed, however it can be rebuilt manually with build-image command. The production image should be built manually - but also a variant of this image is built automatically when kubernetes tests are executed see Running Kubernetes tests
Here is the part of Breeze video which is relevant (note that it refers to the old ./breeze-legacy command but it is very similar to current breeze command):
Building the image first time pulls a pre-built version of images from the Docker Hub, which may take some time. But for subsequent source code changes, no wait time is expected. However, changes to sensitive files like setup.py or Dockerfile.ci will trigger a rebuild that may take more time though it is highly optimized to only rebuild what is needed.
Breeze has built in mechanism to check if your local image has not diverged too much from the latest image build on CI. This might happen when for example latest patches have been released as new Python images or when significant changes are made in the Dockerfile. In such cases, Breeze will download the latest images before rebuilding because this is usually faster than rebuilding the image.
Those are all available flags of build-image command:
You can also pull the CI images locally in parallel with optional verification.
Those are all available flags of pull-image command:
Finally, you can verify CI image by running tests - either with the pulled/built images or with an arbitrary image.
Those are all available flags of verify-image command:
Breeze can also be used to verify if provider classes are importable and if they are following the right naming conventions. This happens automatically on CI but you can also run it manually.
breeze verify-provider-packages
You can also run the verification with an earlier airflow version to check for compatibility.
Breeze can also be used to prepare airflow packages - both "apache-airflow" main package and provider packages.
You can read more about testing provider packages in TESTING.rst
There are several commands that you can run in Breeze to manage and build packages:
preparing Provider documentation files
preparing Airflow packages
preparing Provider packages
Preparing provider documentation files is part of the release procedure by the release managers and it is described in detail in dev .
The below example perform documentation preparation for provider packages.
breeze prepare-provider-documentation
By default, the documentation preparation runs package verification to check if all packages are importable, but you can add --skip-package-verification to skip it.
You can also add --answer yes to perform non-interactive build.
The packages are prepared in dist folder. Note, that this command cleans up the dist folder before running, so you should run it before generating airflow package below as it will be removed.
The below example builds provider packages in the wheel format.
breeze prepare-provider-packages
If you run this command without packages, you will prepare all packages, you can however specify providers that you would like to build. By default both types of packages are prepared ( wheel and sdist, but you can change it providing optional --package-format flag.
breeze prepare-provider-packages google amazon
You can see all providers available by running this command:
breeze prepare-provider-packages --help
You can prepare airflow packages using breeze:
breeze prepare-airflow-package
This prepares airflow .whl package in the dist folder.
Again, you can specify optional --package-format flag to build selected formats of airflow packages, default is to build both type of packages sdist and wheel.
The Production image is also maintained in GitHub Container Registry for Caching and in apache/airflow manually pushed for released versions. This Docker image (built using official Dockerfile) contains size-optimised Airflow installation with selected extras and dependencies.
However in many cases you want to add your own custom version of the image - with added apt dependencies, python dependencies, additional Airflow extras. Breeze's build-image command helps to build your own, customized variant of the image that contains everything you need.
You can switch to building the production image by using build-prod-image command. Note, that the images can also be built using docker build command by passing appropriate build-args as described in IMAGES.rst , but Breeze provides several flags that makes it easier to do it. You can see all the flags by running breeze build-prod-image --help, but here typical examples are presented:
This installs additional apt dependencies - libasound2-dev in the build image and libasound in the final image. Those are development dependencies that might be needed to build and use python packages added via the --additional-python-deps flag. The dev dependencies are not installed in the final production image, they are only installed in the build "segment" of the production image that is used as an intermediate step to build the final image. Usually names of the dev dependencies end with -dev suffix and they need to also be paired with corresponding runtime dependency added for the runtime image (without -dev).
Those are all available flags of build-prod-image command:
Here is the part of Breeze video which is relevant (note that it refers to the old ./breeze-legacy command but it is very similar to current breeze command):
You can also pull PROD images in parallel with optional verification.
Those are all available flags of pull-prod-image command:
Finally, you can verify PROD image by running tests - either with the pulled/built images or with an arbitrary image.
Those are all available flags of verify-prod-image command:
The Production image can be released by release managers who have permissions to push the image. This happens only when there is an RC candidate or final version of Airflow released.
You release "regular" and "slim" images as separate steps.
By default when you are releasing the "final" image, we also tag image with "latest" tags but this step can be skipped if you pass the --skip-latest flag.
These are all of the available flags for the release-prod-images command:
You can run static checks via Breeze. You can also run them via pre-commit command but with auto-completion Breeze makes it easier to run selective static checks. If you press <TAB> after the static-check and if you have auto-complete setup you should see auto-completable list of all checks available.
breeze static-checks -t run-mypy
The above will run mypy check for currently staged files.
You can also pass specific pre-commit flags for example --all-files :
breeze static-checks -t run-mypy --all-files
The above will run mypy check for all files.
There is a convenience --last-commit flag that you can use to run static check on last commit only:
breeze static-checks -t run-mypy --last-commit
The above will run mypy check for all files in the last commit.
There is another convenience --commit-ref flag that you can use to run static check on specific commit:
The above will run mypy check for all files in the 639483d998ecac64d0fef7c5aa4634414065f690 commit. Any commit-ish reference from Git will work here (branch, tag, short/long hash etc.)
If you ever need to get a list of the files that will be checked (for troubleshooting) use these commands:
breeze static-checks -t identity --verbose # currently staged files
breeze static-checks -t identity --verbose --from-ref $(git merge-base main HEAD) --to-ref HEAD # branch updates
Those are all available flags of static-checks command:
Here is the part of Breeze video which is relevant (note that it refers to the old ./breeze-legacy command but it is very similar to current breeze command):
Note
When you run static checks, some of the artifacts (mypy_cache) is stored in docker-compose volume so that it can speed up static checks execution significantly. However, sometimes, the cache might get broken, in which case you should run breeze stop to clean up the cache.
To build documentation in Breeze, use the build-docs command:
breeze build-docs
Results of the build can be found in the docs/_build folder.
The documentation build consists of three steps:
verifying consistency of indexes
building documentation
spell checking
You can choose only one stage of the two by providing --spellcheck-only or --docs-only after extra -- flag.
breeze build-docs --spellcheck-only
This process can take some time, so in order to make it shorter you can filter by package, using the flag --package-filter <PACKAGE-NAME>. The package name has to be one of the providers or apache-airflow. For instance, for using it with Amazon, the command would be:
Often errors during documentation generation come from the docstrings of auto-api generated classes. During the docs building auto-api generated files are stored in the docs/_api folder. This helps you easily identify the location the problems with documentation originated from.
Those are all available flags of build-docs command:
Here is the part of Breeze video which is relevant (note that it refers to the old ./breeze-legacy command but it is very similar to current breeze command):
Whenever setup.py gets modified, the CI main job will re-generate constraint files. Those constraint files are stored in separated orphan branches: constraints-main, constraints-2-0.
You can use breeze generate-constraints command to manually generate constraints for all or selected python version and single constraint mode like this:
Warning
In order to generate constraints, you need to build all images with --upgrade-to-newer-dependencies flag - for all python versions.
Constraints are generated separately for each python version and there are separate constraints modes:
'constraints' - those are constraints generated by matching the current airflow version from sources
and providers that are installed from PyPI. Those are constraints used by the users who want to install airflow with pip.
"constraints-source-providers" - those are constraints generated by using providers installed from current sources. While adding new providers their dependencies might change, so this set of providers is the current set of the constraints for airflow and providers from the current main sources. Those providers are used by CI system to keep "stable" set of constraints.
"constraints-no-providers" - those are constraints generated from only Apache Airflow, without any providers. If you want to manage airflow separately and then add providers individually, you can use those.
Those are all available flags of generate-constraints command:
In case someone modifies setup.py, the scheduled CI Tests automatically upgrades and pushes changes to the constraint files, however you can also perform test run of this locally using the procedure described in Refreshing CI Cache which utilises multiple processors on your local machine to generate such constraints faster.
This bumps the constraint files to latest versions and stores hash of setup.py. The generated constraint and setup.py hash files are stored in the files folder and while generating the constraints diff of changes vs the previous constraint files is printed.
You can set up your host IDE (for example, IntelliJ's PyCharm/Idea) to work with Breeze and benefit from all the features provided by your IDE, such as local and remote debugging, language auto-completion, documentation support, etc.
Use the right command to activate the virtualenv (workon if you use virtualenvwrapper or pyenv activate if you use pyenv.
Initialize the created local virtualenv:
./scripts/tools/initialize_virtualenv.py
Warning
Make sure that you use the right Python version in this command - matching the Python version you have in your local virtualenv. If you don't, you will get strange conflicts.
Select the virtualenv you created as the project's default virtualenv in your IDE.
Note that you can also use the local virtualenv for Airflow development without Breeze. This is a lightweight solution that has its own limitations.
More details on using the local virtualenv are available in the LOCAL_VIRTUALENV.rst.
Here is the part of Breeze video which is relevant (note that it refers to the old ./breeze-legacy but it is not available in the breeze command):
You can use Breeze to run docker-compose tests. Those tests are run using Production image and they are running test with the Quick-start docker compose we have.
Breeze helps with running Kubernetes tests in the same environment/way as CI tests are run. Breeze helps to setup KinD cluster for testing, setting up virtualenv and downloads the right tools automatically to run the tests.
Here is the part of Breeze video which is relevant (note that it refers to the old ./breeze-legacy command and it is not yet available in the current breeze command):
After starting up, the environment runs in the background and takes precious memory. You can always stop it via:
breeze stop
Those are all available flags of stop command:
Here is the part of Breeze video which is relevant (note that it refers to the old ./breeze-legacy command but it is very similar to current breeze command):
Breeze requires certain resources to be available - disk, memory, CPU. When you enter Breeze's shell, the resources are checked and information if there is enough resources is displayed. However you can manually run resource check any time by breeze resource-check command.
Those are all available flags of resource-check command:
When our CI runs a job, it needs all memory and disk it can have. We have a Breeze command that frees the memory and disk space used. You can also use it clear space locally but it performs a few operations that might be a bit invasive - such are removing swap file and complete pruning of docker disk space used.
Those are all available flags of free-space command:
When our CI runs a job, it needs to decide which tests to run, whether to build images and how much the test should be run on multiple combinations of Python, Kubernetes, Backend versions. In order to optimize time needed to run the CI Builds. You can also use the tool to test what tests will be run when you provide a specific commit that Breeze should run the tests on.
More details about the algorithm used to pick the right tests can be found in Selective Checks.
Those are all available flags of selective-check command:
When our CI runs a job, we automatically upgrade our dependencies in the main build. However, this might lead to conflicts and pip backtracking for a long time (possibly forever) for dependency resolution. Unfortunately those issues are difficult to diagnose so we had to invent our own tool to help us with diagnosing them. This tool is find-newer-dependencies and it works in the way that it helps to guess which new dependency might have caused the backtracking. The whole process is described in tracking backtracking issues.
Those are all available flags of find-newer-dependencies command:
When you are in the CI container, the following directories are used:
/opt/airflow - Contains sources of Airflow mounted from the host (AIRFLOW_SOURCES).
/root/airflow - Contains all the "dynamic" Airflow files (AIRFLOW_HOME), such as:
airflow.db - sqlite database in case sqlite is used;
dags - folder with non-test dags (test dags are in /opt/airflow/tests/dags);
logs - logs from Airflow executions;
unittest.cfg - unit test configuration generated when entering the environment;
webserver_config.py - webserver configuration generated when running Airflow in the container.
Note that when running in your local environment, the /root/airflow/logs folder is actually mounted from your logs directory in the Airflow sources, so all logs created in the container are automatically visible in the host as well. Every time you enter the container, the logs directory is cleaned so that logs do not accumulate.
When you are in the production container, the following directories are used:
/opt/airflow - Contains sources of Airflow mounted from the host (AIRFLOW_SOURCES).
/root/airflow - Contains all the "dynamic" Airflow files (AIRFLOW_HOME), such as:
airflow.db - sqlite database in case sqlite is used;
dags - folder with non-test dags (test dags are in /opt/airflow/tests/dags);
logs - logs from Airflow executions;
unittest.cfg - unit test configuration generated when entering the environment;
webserver_config.py - webserver configuration generated when running Airflow in the container.
Note that when running in your local environment, the /root/airflow/logs folder is actually mounted from your logs directory in the Airflow sources, so all logs created in the container are automatically visible in the host as well. Every time you enter the container, the logs directory is cleaned so that logs do not accumulate.
To run Docker Compose commands (such as help, pull, etc), use the docker-compose command. To add extra arguments, specify them after -- as extra arguments.
Sometimes during the build, you are asked whether to perform an action, skip it, or quit. This happens when rebuilding or removing an image and in few other cases - actions that take a lot of time or could be potentially destructive. You can force answer to the questions by providing an --answer flag in the commands that support it.
For automation scripts, you can export the ANSWER variable (and set it to y, n, q, yes, no, quit - in all case combinations).
On Linux, there is a problem with propagating ownership of created files (a known Docker problem). The files and directories created in the container are not owned by the host user (but by the root user in our case). This may prevent you from switching branches, for example, if files owned by the root user are created within your sources. In case you are on a Linux host and have some files in your sources created by the root user, you can fix the ownership of those files by running :
breeze fix-ownership
Those are all available flags of fix-ownership command:
Important sources of Airflow are mounted inside the airflow container that you enter. This means that you can continue editing your changes on the host in your favourite IDE and have them visible in the Docker immediately and ready to test without rebuilding images. You can disable mounting by specifying --skip-mounting-local-sources flag when running Breeze. In this case you will have sources embedded in the container and changes to these sources will not be persistent.
After you run Breeze for the first time, you will have empty directory files in your source code, which will be mapped to /files in your Docker container. You can pass there any files you need to configure and run Docker. They will not be removed between Docker runs.
By default /files/dags folder is mounted from your local <AIRFLOW_SOURCES>/files/dags and this is the directory used by airflow scheduler and webserver to scan dags for. You can use it to test your dags from local sources in Airflow. If you wish to add local DAGs that can be run by Breeze.
If you do not use start-airflow command, you can start the webserver manually with the airflow webserver command if you want to run it. You can use tmux to multiply terminals. You may need to create a user prior to running the webserver in order to log in. This can be done with the following command:
For databases, you need to run airflow db reset at least once (or run some tests) after you started Airflow Breeze to get the database/tables created. You can connect to databases with IDE or any other database client:
You can change the used host port numbers by setting appropriate environment variables:
SSH_PORT
WEBSERVER_HOST_PORT
POSTGRES_HOST_PORT
MYSQL_HOST_PORT
MSSQL_HOST_PORT
FLOWER_HOST_PORT
REDIS_HOST_PORT
If you set these variables, next time when you enter the environment the new ports should be in effect.
If you need to change apt dependencies in the Dockerfile.ci, add Python packages in setup.py for airflow and in provider.yaml for packages. If you add any "node" dependencies in airflow/www or airflow/ui, you need to compile them in the host with breeze compile-www-assets command.
You can add dependencies to the Dockerfile.ci, setup.py. After you exit the container and re-run breeze, Breeze detects changes in dependencies, asks you to confirm rebuilding the image and proceeds with rebuilding if you confirm (or skip it if you do not confirm). After rebuilding is done, Breeze drops you to shell. You may also use the build-image command to only rebuild CI image and not to go into shell.
During development, changing dependencies in apt-get closer to the top of the Dockerfile.ci invalidates cache for most of the image. It takes long time for Breeze to rebuild the image. So, it is a recommended practice to add new dependencies initially closer to the end of the Dockerfile.ci. This way dependencies will be added incrementally.
Before merge, these dependencies should be moved to the appropriate apt-get install command, which is already in the Dockerfile.ci.
Breeze uses built-in capability of rich to record and print the command help as an svg file. It's enabled by setting RECORD_BREEZE_OUTPUT_FILE to a file name where it will be recorded. By default it records the screenshots with default characters width and with "Breeze screenshot" title, but you can override it with RECORD_BREEZE_WIDTH and RECORD_BREEZE_TITLE variables respectively.
Breeze was installed with pipx, with pipx list, you can list the installed packages. Once you have the name of breeze package you can proceed to uninstall it.
pipx list
This will also remove breeze from the folder: ${HOME}.local/bin/
pipx uninstall apache-airflow-breeze
In Cassandra Lunch #103, we discuss the UML Architecture of a Cassandra Cluster and discuss the Azure Ecosystem’s new tool the Digital Twin Explorer. You can download the files used in the Digital Twin Domain Explorer demo on our Github.
UML, the Unified Modeling Language, is one of the primary languages used in the process of architecture definition. It can be used to define the logical, functional, and physical components and structure of a system.
Three Types of Architecture
Industry best practices for architecture typically split architecture into three types.
Functional Architecture
Functional architecture defines relationships between concepts in an enterprise. A well-managed functional architecture will define the system’s logical behavior, but won’t define the physical locations responsible for hosting the classes and services defined in the functional diagram.
Functional architecture includes UML Class, Sequence, and State Diagrams.
One outcome of a functional architecture is the definition of the databases and data architecture that need to be managed for your system.
Generic Physical Architecture
Generic physical architecture defines the locations and groupings of the services and places them on physical devices. It doesn’t select specific service providers for each service.
UML diagrams associated with generic physical architecture include Component, Package, and Deployment Diagrams.
One outcome of a generic physical architecture is a deployment agnostic architecture that can be instantiated on any cloud provider.
Specific Physical Architecture
Specific physical architecture specifies the nodes and services defined the the generic physical architecture.
The UML Diagrams associated with specific physical architecture are Deployment and Node diagrams.
The outcome of a specific physical architecture is a system definition with service definition selections that can be used to generate sprint plans and fully realized feature backlogs.
Three Main Relationship in a Class Diagram
There are three relationships that define the most important connections between elements in the UML Class Diagram. These relationships clarify the parts of the diagram.
Relationship Name
Explanation
Translation From English
Arrow Shape
Association
the most generic relationship in a Class diagram connecting two classes
“has a”
solid black triangle
Composition
a relationship indicating that on class is a component of another class
“composes” / “is composed of”
solid black diamond
Inheritance
a relationship indicating the one class is a subclass of another – implying that the sub-class can inherit the properties and operation of the class
“is a”
open white triangle
Functional Architecture Diagram for Cassandra Cluster Architecture: Class Diagram
Architecture Class Diagram of Cassandra Cluster
Some of the relationships defined in this class diagram include;
A Cluster is composed of Data Centers
A Data Center is Composed of Racks
A Node has software and Data
A Data Center is a Physical Building
Generic Physical Architecture Diagrams for Cassandra Cluster Architecture
Below is an example of a node diagram that crosses the boundary between generic and specific physical. Node diagrams specify structure in the system – the rectangular boxes are physical or logical machines or groups of machines. Because Airflow, Cassandra, and Kubernetes are specified, this diagram crosses into specific. The most important information communicated by node diagrams like this one is the location of specific services within containers.
Node Diagram for Cassandra Data Processing and Monitoring Cluster
Specific Physical Architecture of a Cassandra Cluster
The diagram below is one example of a Cassandra cluster. Notice that the cluster shown below has specific services such as Kafka, Spark, Cassandra, and Akka selected. A more thorough specific physical architecture would include more information about how the nodes are connected, including information about ports and service connections.
Digital Twins, What and Why?
A digital twin is a simulated system. Ultimately, a digital twin can be simple or complex; however, to be effective, it needs to capture information both about your system and environment that surrounds it. The system is embedded in the environment and can be used to estimate measures of effectiveness and performance.
Digital twins place the system defined by architecture into scenarios specified by the creator of the digital twin. This is vastly more flexible than a series of stress tests, as scenario definitions can provide insight into the performance of the system over time and in unusual siuations.
The most mature form of a digital twin will pull values from sensors in the real world. You can then use Azure’s models, CLI, and queries to observe and manage the system. You can also add scenarios to project system performance over the entire life-cycle of a system.
1.0 – Access your account on the Azure Ecosystem
You’ll need to register and certify your identity on the Azure Ecosystem. How to do this is beyond the scope of this
1.1 – Create a Resource Group
Create Resource Group
1.2 – Create an Azure Digital Twin Resource Instance
Azure Digital Twin Service
Note that you can accomplish steps 1 and 2 in a single step through the specified “Create New” option.
Azure Digital Twin Resource Options
Fill out the required fields. Notice that you can either create a new resource group or identify the one you created in step 1.1.
Be sure to check the “Assign Azure Digital Twin Data Owner Role” box.
1.3 – Open the Azure Digital Twins Explorer
Open the Explorer
1.4 – Upload your files to Generate the Model
Upload Instruction- Model
You can upload an entire folder of models using the option in the red box or a single file using the option one to the left of the box.
Once you have added files to the model. You’ll see that the classes appear in the model, without specific relationships. That’s because the relationship aren’t explicitly defined is this demo until the scenario is uploaded in step 1.5.
1.5 – Upload the Scenario to instantiate it by giving values to the properties you’ve defined.
Twin View and Scenario Upload
The red box indicates where you can upload the scenario.
Once you’ve uploaded the scenario you will see the specific relationships in the model view.
After the scenario is uploaded, you can run queries on the data properties and relationships.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Do user-centric companies dream of data stacks? You betcha.
For business users, this dream looks like fresh, actionable data available in frontline tools like Salesforce, Braze, and Marketo whenever they need it.
For data teams, this dream looks like a world where their data--enriched with tools like Fivetran, Snowplow, and dbt--is used to its full potential to fuel operational excellence for every business team that consumes it.
However, the reality of data stacks past hasn’t lived up to these dreams, but they’ve gotten pretty close. Data teams leverage ETL (Extract, Transform, Load) tools like Fivetran to load customer data from mobile and web apps into a central data warehouse. This lets them do deep analysis, build advanced predictive models, and feed BI tools and dashboards to help business teams make decisions. But there’s still a gap--what we call “the last mile”--between the warehouse and frontline tools.
The bridge that lets you cross the gap over this last mile of your modern data stack--and from your current data reality to your data dreams--is reverse ETL. Reverse ETL is the difference between making decisions based on your data and finally being able to take action to realize your data dreams.
Let’s dive in.
What is reverse ETL?
Reverse ETL is the process of syncing data from a source of truth like a data warehouse to a system of actions like CRM, advertising platform, or other SaaS app to operationalize data.
That’s basically just a fancy way of saying reverse ETL lets you move data about your users from your warehouse and makes it available for frontline business teams to use in their favorite tools.
However, to really understand the power of reverse ETL (and why it’s not just another data pipeline), we first need to take a quick look at what traditional ETL pipelines made possible for business and data teams.
What’s in a name: ETL vs reverse ETL
The traditional extract, transform, load (ETL) data pipeline has remained largely unchanged since the 1970s: extract the data from the source, convert it to a usable format (or transformation), then load it into your data warehouse.
The advent of flexible data pipeline tools like Fivetran has also made it possible to load your data into the warehouse and then use your storage target to transform it (referred to as ELT). These ETL/ELT enabled companies to combine data from multiple sources into a single source of truth to inform business intelligence decisions.
This version of the modern data stack worked well when data sources were more limited (i.e. there was less data volume) and the data engineers who supported these stacks had ample bandwidth to process and answer questions about data. As you’ve probably experienced, that’s no longer the case and teams need more sophisticated tools to achieve the dream of operational analytics.
This reverse journey à la reverse ETL makes operational analytics possible. Reverse ETL tools flip the Fivetran role, extracting data from the warehouse, transforming it so it plays nice with the target destination’s API (whether Salesforce, HubSpot, Marketo, Zendesk, or others), and loading it into the desired target app.
Modern data stack 2.0: The era of operational analytics
The reverse ETL-inclusive modern data stack is the modern data stack 2.0. The growth in popularity of this new generation of data stack is emblematic of an important trend: Companies need to move data capabilities out of centralized silos and embed them within teams across business functions.
Reverse ETL equips these teams with detailed data inside the tools they're already using like Salesforce or Hubspot, empowering them to be more effective in their day-to-day work. The reverse ETL process effectively aligns your organization and applications around your source of truth. From there, business teams can build a shared, deep understanding of customers like never before.
The continuous flow of data--from raw data being pulled into apps to data being modeled to data being deployed into each app--creates a virtuous loop of operational analytics. And it’s only possible with reverse ETL.
The modern data stack 2.0 generally consists of the following tools performing four key functions to close the operational analytics loop:
Data integration: Also referred to as collection, this is an ETL tool like Fivetran or Snowplow that integrates your data sources into your warehouse.
Data storage: A data warehouse that can store structured and unstructured data in one place like Google BigQuery, Snowflake, or Amazon Redshift.
Data modeling: A modeling tool like dbt comes pre-configured with a massive library of data models to make your data usable in different situations.
Data operationalization: A reverse ETL tool like Census will pull data out of your warehouse, validate it, and load it into applications that need it like Salesforce or Zendesk.
As more teams within an organization require data to drive their daily operations, reverse ETL will become necessary to support democratizing data at scale.
Why you need reverse ETL
Without a reverse ETL tool, your data, and the insights from it, are locked within your BI tools and dashboards. This won’t fly in the era of product-led growth, which pushes companies across the B2B and B2C spectrum to improve customer experiences with personal, data-informed strategies.
As we touched on above, the key to this personal customer connection lies in operationalizing our data. Before reverse ETL, data pipelines were built for analytics alone (which meant data efforts were primarily focused on understanding past behavior). Now, companies can architect their data stacks to fuel future action, as well as understand past events (aka operational analytics).
At its core, operational analytics is about putting an organization’s data to work so everyone can make smart decisions about your business. - Boris Jabes, Census CEO
Reverse ETL lives at the heart of operational analytics at scale, constantly pumping real-time customer data into third-party applications to ensure when it comes time to make a decision, the right person has the right data to do it.
When teams across an organization work with synced data, traditionally difficult to automate tasks become much more straightforward. For example, reverse ETL makes it possible to intervene in the customer journey at just the right time by connecting your CRM and email platform to your data warehouse. This means more successful outreach campaigns and more delighted customers.
Reverse ETL use cases
Connecting teams throughout your organization to the warehouse using reverse ETL empowers them with data enriched with valuable context about what your customers are doing in real-time. As we discussed, an operational analytics approach puts data into the hands of people to inform their day-to-day operations. Let’s look at some leading use cases of how customer success, sales, marketing, and data teams benefit from reverse ETL.
CS success with better, faster data and reverse ETL
Customer success teams are responsible for more business outcomes than ever before, from traditional support efforts to product adoption to retention efforts to expansion initiatives. To meaningfully contribute goals in each lane of their job descriptions, customer success teams need high-quality, trustworthy data when they need it in the tools they rely on.
Industry-leading companies like Loom, Atrium, and Bold Penguin have upgraded their modern data stacks with reverse ETL to accomplish some awe-inspiring milestones, including but not limited to:
Helping customer success and sales better collaborate to reiterate product value at just the right time to customers.
Making account type and hierarchical ticket prioritization a reality.
Supporting self-serve data capabilities for the customer success team, decreasing their reliance on the data team.
Reducing response times for common support issues from days to minutes.
As we've said before, reverse ETL isn't about adding another tool to your stack, it's about empowering people with better data so they're unblocked to do their best work. With reverse ETL, customer success and ops teams can quickly and easily tap into powerful data insights to better serve customers and contribute to growth goals.
Need some CS inspiration? Check out ourcustomer service use-casesin our Good to great series highlighting the best and brightest in CS.
Sales team heroics: Up-leveling sales with reverse ETL
In the era of product-led growth, it's no longer enough to just have a great product, you need to foster a good relationship with every lead from the start.
Often, reverse ETL is the difference between a missed connection and a life-long customer bond.
Here are some of the people-first use cases cutting-edge sales teams at companies like Figma, LogDNA, and Snowplow Analytics have unlocked using reverse ETL:
Improved understanding of what features customers loved most and where each customer was in their life cycle.
Unified understanding of customers and the organizations they belonged to with identity resolution.
Improved AE and AM focus and effectiveness with lead scoring.
Real-time sales forecasting in Google Sheets.
Gave the sales team access to high-quality behavioral data in the tools they loved to help them meaningfully connect with prospects (without engineering favors).
Need some sales inspiration? Check outsales use-casesin our Good to great series highlighting the best and brightest in sales.
Building better, faster, stronger marketing teams with reverse ETL
With customers' expectations climbing higher every year, it's more important than ever that marketing teams have access to complete, fresh data to attract and convert new customers (and delight current users).
Industry-leading marketing teams--like the ones found at Notion and Canva--have cracked the code on data-driven marketing operations with reverse ETL. Here are a few examples of what you too could do with reverse ETL:
Eliminate the need for custom integration requests and manual email address uploads.
Quickly and easily get data into Salesforce workflows for lead scoring and PQLs.
Get the full functionality out of all your existing tools (and unlock tools on your marketing team's wishlist).
Leverage more actionable user data to drive segmentation and personalization.
Fuel faster experimentation with ad targeting and user propensity scoring.
With reverse ETL marketing teams can build hyper-personalized marketing campaigns by merging product, support, and sales data to power customer segmentation. No more missed opportunities.
Need some marketing inspiration? Check outmarketing use-casesin our Good to great series highlighting the best and brightest in marketing.
Reverse ETL helps data teams step into their power
No one got promoted for building ETL/reverse ETL. When data teams spend their team building and maintaining bespoke integration solutions they're blocked from doing the innovative, high-impact data work they were hired for.
With reverse ETL, data teams at companies like Canva, Clearbit, and Loom have been able to not just better meet the ad hoc needs of business teams, but carve out time to change the role and culture of data entirely at their organizations. This kind of visionary data work is what nearly every industry in the game needs to embrace to move into the future.
Here are some examples of what reverse ETL can do for data teams:
Reduce the time data teams spend doing tedious integration build work and more time doing exciting, engaging data work.
Increase the ability of data leaders to advocate for the skill sets of the data team and establish data team as a key stakeholder.
Foster happier internal customers of data, which means more people take strategic action from data.
Generate fresher, more accurate data for outreach campaigns.
Give the data team complete control of the data flow from ETL to the frontline tools.
Need some data team inspiration? Check outdata-team success storiesin our Good to great series highlighting the best and brightest in data.
Reverse ETL vs point-to-point integrations
The no-code, plug-and-play nature of a point-to-point platform like Workato, Zapier, or Mulesoft often entices teams without dedicated technical or data resources to set up any necessary integrations. But relying too heavily on these quick fixes can quickly get messy as your data stack grows.
Fully integrating point-to-point solutions with your data stack requires exponentially more connections as your stack grows. The number of connections grows by the square of the number of applications, meaning eight apps could require as many as 64 distinct connections to keep your entire stack in sync.
Things can get messy quickly when you’re trying to manage too many integrations.
But with all of the customer data you already have sitting in your warehouse, there’s a better way. Instead of a messy, spaghetti pile of point-to-point integrations, you can use reverse ETL to architect your data infrastructure as a series of orderly spokes around a central hub (data warehouse). This creates a single source of truth informing each application and workflow within your stack to make you truly data-informed.
What to look for in a reverse ETL tool
As is the case with most software, when looking for a reverse ETL tool you’ll have to decide whether to buy an established product or attempt to build a bespoke solution with your resources on hand.
Building a custom reverse ETL pipeline may seem attractive, but it comes with the added complexity of not only engineering each individual connector but maintaining them against ever-changing destination APIs.
If you want to save your business teams from endless ticket filing (and save your engineers from having to address all those tickets), it’s time to consider a managed reverse ETL solution from an expert vendor. Here is a high-level overview of the seven key features to look for in a potential reverse ETL tool:
Connector quality: A reverse ETL tool is only as useful as the applications it connects to. Look for the connections you need today and the specific features of each.
Sync robustness: Syncing is arguably the most important feature and should be fast, be reliable, sync only data that’s changed, and be automatable.
Observability: Your reverse ETL should offer alerting, integrations with monitoring tools, detailed logs, and the ability to rollback syncs, if necessary.
Security and regulatory compliance: Vendors should have security credentials like SOC Type I or II, encrypt data in transit and at rest, and use best-in-class security for APIs.
SQL fluency and ease of use: To be as user-friendly as possible, your reverse ETL tool should be SQL friendly, allow for easy modeling, and have an intuitive user interface.
Community and vendor support: Make sure your reverse ETL vendor has a high commitment to SLAs, readily available support and in-app support, and good documentation.
Transparent pricing: When buying a reverse ETL tool, make sure you know if the vendor charges by consumption, number of connectors, or fields per sync.
If you do your due diligence when selecting reverse ETL vendors, you’ll have the ultimate tool in your toolbox to ensure you get the most of your data today and as you scale in the future.
Reverse ETL makes your data (and the teams that use it) more efficient
When front-line teams can self-serve highly detailed customer data, translated, validated, and formatted for their favorite tools, data teams can spend less time crunching numbers and running reports and more time using their insights to inform business strategy.
The traditional role of data or analytics teams was, first and foremost, to report on how a product or campaign performs over time and serve the requests of the business teams they support.
This type of reporting and support was useful for monitoring the long-term health of your user base or high-level budget planning, but it couldn’t power automation or help customer success managers triage incoming support requests.
Today, data teams have embraced a whole new set of sophisticated analytics engineering skills. Unblock them and let them use these skills (you’ll be amazed at what they can do, we promise).
With reverse ETL in place, modern data teams turn data warehouses into the central nervous system of an organization, fueling email marketing, customer support tools, sales tools, or even financial models. This means more successful business teams that can self-serve deep, useful data and more efficient DataOps overall.
“With Rivery, you can rely on a system that is easy-to-use, easy to scale up, and at the end of the day, you’re not spending most of your time cleaning up data or worrying about your ETL pipelines going down.”
In today’s data-first economy, ETL is not just a process for centralizing data in a data warehouse. Increasingly, teams are using ETL to send data from a data warehouse into third party systems. “Reverse ETL,” as this method is known, is quickly becoming a core part of tech data stacks. Here’s everything your data team needs to know about reverse ETL.
ETL vs. Reverse ETL – What’s the Difference?
To understand reverse ETL, consider this quick refresher on traditional ETL. Extract, transform, and load (ETL) is a data integration methodology that extracts raw data from sources, transforms the data on a secondary processing server, and then loads the data into a target database.
More recently, with the rise of cloud data warehouses, extract, load, transform (ELT) is beginning to supplant ETL. Unlike the ETL method, ELT does not require data transformation before the loading process. ELT loads raw data directly into a cloud data warehouse. Data transformations are executed inside the data warehouse via SQL pushdowns, Python scripts, and other code.
ETL and ELT both transfer data from third party systems, such as business applications (Hubspot, Salesforce) and databases (Oracle, MySQL), into target data warehouses. But with reverse ETL, the data warehouse is the source, rather than the target. The target is a third party system. In reverse ETL, data is extracted from the data warehouse, transformed inside the warehouse to meet the data formatting requirements of the third party system, and then loaded into the third party system for action taking.
The method is referred to as reverse ETL, rather than reverse ELT, because data warehouses cannot load data directly into a third party system. The data must first be transformed to meet the formatting requirements of the third party system. However, this process is not traditional ETL, because data transformation is performed inside the data warehouse. There is no “in-between” processing server that transforms the data.
Here’s an example: If a Tableau report contains a customer lifetime value (LTV) score, this Tableau-formatted data is not processable in Salesforce. So a data engineer applies an SQL-based transformation to this report data within Snowflake to isolate the LTV score, format it for Salesforce, and push it into a Salesforce field so sales representatives can use the information.
What’s the Impact of Reverse ETL?
By pushing data back into third party systems such as business applications, reverse ETL operationalizes data throughout an organization. Reverse ETL enables any team, from sales, to marketing, to product, to access the data they need, within the systems they use. The applications of reverse ETL are numerous, but some examples include:
Syncing internal support channels with Zendesk to prioritize customer service
Pushing customer data to Salesforce to enhance the sales process
Adding product metrics to Pendo to improve the customer experience
Combining support, sales, and product data in Hubspot to personalize marketing campaigns for customers.
Even in companies with a cloud data warehouse, data does not always end up in the right hands. Reverse ETL solves this problem by pushing data directly into the applications leveraged by line-of-business (LOB) users. In some companies, teams already have access to the data they need via BI reports. But, to the dismay of BI developers, these reports are often underutilized.
What teams really want is to access data within the systems and processes that they are familiar with. This is exactly what reverse ETL enables. With reverse ETL, business users can actually harness data in an operational capacity. Teams can act on the data in real-time, and use it to make key decisions.
Reverse ETL can also streamline data automation throughout a company. Reverse ETL helps eliminate manual data processes, such as CSV pulls and imports, involved in data tasks. Sometimes, reverse ETL is another way to complete a step in a broader data workflow. For instance, if you’re building an AI/ML workflow on top of your Databricks stack, reverse ETL can push formatted data into the sequence. Companies are also increasingly incorporating reverse ETL into in-app processes, such as syncing production databases.
Reverse ETL: What’s the Right Solution?
A team can either build or buy an ETL solution. Those that build an ETL solution must create data connectors — the connection between the data source and the data warehouse — from scratch. This building process can take weeks or months, depending on the development team, and hampers scalability, drains dev resources, and saddles the system with long-term maintenance requirements. For these reasons and others, data teams often consider SaaS ETL platforms.
SaaS ETL platforms come with “pre-built” data connectors. The number of connectors vary by provider, but platforms often offer “plug-and-play” ETL connectors for the most popular data sources. However, these data connectors extract data from third party systems and load them into data warehouses. Reverse ETL data connectors are the opposite. They must extract data from data warehouses and load them into third party systems.
In other words: an ETL data connector is not a reverse ETL data connector. So when you see ETL platforms advertising ETL data connectors, that does not necessarily mean that a reverse ETL data connector is also available. In fact, it’s not uncommon for teams to adopt an ETL platform and still have to build reverse ETL connectors on the backend.
Building a reverse ETL connector can be just as intensive as building a traditional ETL connector. So, if reverse ETL is important to you, research the feature for each platform before you buy. At Rivery, we offer reverse ETL for all of our data sources, including via Action Rivers. Send data from your cloud data warehouse to business applications, marketing clouds, CPDs, REST APIs, and more.
Simple Solutions for Complex Data Pipelines and Workflows
Rivery's SaaS platform provides a unified solution for ELT pipelines, workflow orchestration, and data operations. Easily solve your most complex data challenges and enable your company to scale at hyperspeed. With data needs growing, Rivery allows you to achieve more with less and create the most efficient, scalable data stack.
Completely Automated SaaS Platform: Get setup and start connecting data in the Rivery platform in just a few minutes with little to no maintenance required.
Unified Data Ingestion, Transformation, & Orchestration: 100% data source capability, insight-ready data with both SQL and Python transformations, and complete workflow automation.
190+ Native Connectors: Instantly connect to applications, databases, file storage options, and data warehouses with our fully-managed and always up-to-date connectors, including BigQuery, Redshift, Shopify, Snowflake, Amazon S3, Firebolt, Databricks, Salesforce, MySQL, PostgreSQL, and Rest API to name just a few.
Python Support: Have a data source that requires custom code? With Rivery’s native Python support, you can pull data from any system, no matter how complex the need.
Change Data Capture/Data Replication: Rivery’s best-in-class CDC support provides an easy, reliable and fast solution for replicating data from a database to your data warehouse.
1-Click Data Apps: With Rivery Kits, deploy complete, production-level workflow templates in minutes with data models, pipelines, transformations, table schemas, and orchestration logic already defined for you based on best practices.
Data Development Lifecycle Support: Separate walled-off environments for each stage of your development, from dev and staging to production, making it easier to move fast without breaking things. Get version control, API, & CLI included.
Data Operations: With Rivery, you get centralized logging & reporting, monitoring & alerts, and data quality as part of a robust data operations layer for your data pipelines.
Solution-Led Support: Consistently rated the best support by G2, receive engineering-led assistance from Rivery to facilitate all your data needs.
Data Security: Industry-leading security and enterprise-grade privacy standards are built into Rivery’s network, product, and policies.
Send Data from Snowflake to Hubspot to Track Free Trial Account Usage
Here’s a real life reverse ETL example we’ve implemented at our very own company. Rivery uses reverse ETL to update custom Hubspot deal properties when certain events transpire in a DWH.
This is how we do it.
To perform Reverse ETL in Rivery, you will need:
Target data warehouse
Data destination (REST API endpoint)
A general knowledge of the destination API structure and behavior is recommended. Rivery’s Action Rivers will pass data from a DWH table through the corresponding inputs of a REST API endpoint.
Paint Point: What Problem is Reverse ETL Solving?
First, our pain point. At Rivery, we want our salespeople to track free trial account usage, so they can better respond to prospects in the sales cycle. This process revolves around three product usage properties in particular:
Number of rivers
Number of runs
Number of distinct sources
In order to power the workflow, we need to send these three custom properties about product usage to Hubspot, our CRM platform of choice, via reverse ETL.
Let’s assume we already have this data loaded in a table in Snowflake called RIVER_EXECUTIONS.
1. Define Data to Push to Hubspot
To start, we must create a logic step that returns only the data we need — number of rivers, runs, and distinct sources — to Hubspot. We want to identify these data points, but only for active trial accounts in our system.
When a new trial account is created, a Hubspot deal is auto-generated. The Hubspot deal includes the Rivery account ID as a custom property. We will use this account ID to connect Rivery usage data with our sales pipeline.
To define the data we need, we will apply a SQL query to the dataset:
select deals.deal_id, runs.rivery_account_id, runs.rivery_num_of_rivers, runs.rivery_executions, runs.rivery_data_sources from "INTERNAL"."ODS"."HUBSPOT_DEALS" deals inner join ( select rivery_account_id, count(distinct river_id) as rivery_num_of_rivers, count(distinct run_id) as rivery_executions, count(distinct datasource_type) as rivery_data_sources from INTERNAL.ODS.RIVER_EXECUTIONS Where account_type = ‘trial’ group by rivery_account_id ) runs on deals.rivery_account_id = runs.rivery_account_id where deals.isdeleted = FALSE;
With this SQL query, we are:
Creating three unique metrics grouped at the account level for trial accounts
Joining these metrics to our existing deal pipeline.
Now, create a new Logic River in Rivery and use the ‘SQL/Script’ type of Logic Step. Choose your target type, your connection, and input your source query (i.e. what we have above).
Define a table in your target data warehouse to store these results. In our example, we’ll call this table ‘TRIAL_DEALS_PROPERTIES’.
2. Transform Data to Meet Hubspot CRM API Requirements
In the next step, we will utilize the Hubspot CRM API to update Hubspot deals with the data from Step 1. Each call to the Hubspot CRM API requires a dealid parameter, and a request body populated with a properties object to update the deals.
By leveraging Snowflake’s JSON-building functions, such as OBJECT_CONSTRUCT(), we can produce this object for each deal_id. This enables us to perform each update call by passing both a deal_id and the corresponding properties into an Rivery Action River.
This example query uses the OBJECT_CONSTRUCT() function to produce the desired results:
select deal_id, object_construct('rivery_num_of_rivers', rivery_num_of_rivers, 'rivery_executions', rivery_executions, 'rivery_data_sources',rivery_data_sources ) as property_values from "INTERNAL"."AUTOMATION"."TRIAL_DEALS_PROPERTIES" group by deal_id, rivery_num_of_rivers, rivery_executions, rivery_data_sources ;
Sample Results
Now we’ll add another Logic Step and use the code above as the source query. However, instead of setting a table as the target value, we’ll store our results in a variable, one of Rivery’s Logic functions.
This will store the results of the query in a variable, which we can leverage in future Logic Steps.
3. Build Custom Connection for Hubspot CRM API
Using the Hubspot CRM API documentation, we can define the corresponding REST template within an Action River. We will leverage the Action River in our existing Logic River.
In the request body, we’ll define a variable called properties. Once we add the Action River as a step in the Logic River, this body will contain the properties that will be updated for each call.
4. Finalize Logic Steps
Lastly, we need to add the Action River from Step 3 to the Logic River from Steps 1 & 2. This will unify the Reverse ETL process in one Logic River.In the Logic River, we’ll add a third step, and select an ‘Action’ step. Choose the Action River from Step 3.
Next, click ‘Container Me’ on the Action River to wrap this step in a container. Change the container type to ‘Loop Over’ from the dropdown. We need to send one request per deal_id, so to handle multiple deals, we must loop over our Action Step if necessary.
The step will look something like this:
In the ‘for each value in’ dropdown, select the variable created by the second Logic Step (called ‘deal_properties’ here). In the Variables menu in the Logic River – make sure this variable is set to accept multiple values:
In the second window, create two iterators, one for the deal_id parameter and one for the properties.
In addition, set the Input Variables in the Action Step to match the iterators created above. This is the key that will connect the data values stored in the variable to the variables defined in the API request.
5. Drive Sales Process with Actionable, Relevant Data
And voila – our salespeople have access to the free trial information they need to optimize the sales process. With reverse ETL, we sent product data from our data warehouse into Hubspot, so our sales team can associate free trials with product usage and engage in more targeted sales practices.
Reverse ETL: Put Data to Work, Make Results Happen
Data warehouses were introduced to eliminate data silos. But for too many companies, data warehouses have become data silos. Reverse ETL solves this conundrum by removing the barrier between a data warehouse and the rest of the company. Now teams can operationalize data, in the systems and processes they feel comfortable with, and act on data to drive results. In today’s breakneck data economy, reverse ETL puts data in the hands and workflows that need it, when they need it.
The DataStax Bulk Loader, dsbulk, is a new bulk loading utility introduced in DSE 6 (To download the DataStax Bulk Loader click here). It solves the task of efficiently loading data into DataStax Enterprise, as well as efficiently unloading data from DSE and counting the data in DSE, all without having to write any custom code or using other components, such as Apache Spark. In addition to the bulk load and bulk unload use cases, dsbulk aids in migrating data to a new schema and migrating data from other DSE systems or from other data systems. There is a good high-level blog post that discusses the benefits of dsbulk:
Easy to use.
Able to support common incoming data formats.
Able to export data to common outgoing data formats.
Able to support multiple field formats, such as dates and times.
Able to support all the DSE data types, including user-defined types.
Able to support advanced security configurations.
Able to gracefully handle badly parsed data and database insertion errors.
Able to report on the status and completion of loading tasks, including summary statistics (such as the load rate).
Efficient and fast.
Now, I’m a person who learns by example, so what I’m going to do in this series of blog posts is show some of the ways to use dsbulk to do some common tasks. For the documentation on dsbulk, including all of the parameters and options, see the documentation pages for dsbulk.
$ cqlsh -e "SELECT COUNT(*) FROM dsbulkblog.iris_with_id;"
count ------- 150
(1 rows)
Warnings : Aggregation query used without partition key
Setup
For these examples, we are going to use a few tables, so let’s start by creating them:
The data files are hosted in a public gist here. You will want to download the individual files to your local machine. For simplicity, let’s say that you download them to /tmp/dsbulkblog. Then you should have these files:
Before we look at the output and the results, let’s break apart this command-line:
load: After the dsbulk command, we have load. There are currently (as of dsbulk 1.2) three modes for dsbulk, load, unload, and count. As you might have guessed, load is used for loading data from files into DSE, and unload is used for exporting data in DSE to files. The count mode will count the data in DSE and report various metrics.
-url: Next up is the -url parameter. This is the file name or location of the file (or resource, such as an HTML URL) to be loaded. It can be a single file, a directory of files, a URL (such as https://gist.github.com/brianmhess/8864cf0cb0ce9ea1fd64e579e9f41100/raw/...), or stdin (which is the default). Here, we are pointing to the /tmp/dsbulkblog/iris.csv file that we downloaded.
-k: Next is the -k parameter, which indicates the keyspace to use.
-t: Then comes the -t parameter, which indicates the table to use.
We set -k to dsbulkblog and -t to iris_with_id, which means we will be loading into the dsbulkblog.iris_with_id table.
Operation LOAD_20190314-161940-640593 completed successfully in 0 seconds. Last processed positions can be found in positions.txt
We can see that dsbulk has written some information to stderr. Let me call out the table of final stats, which indicate that a total of 150 records, all of which were successful, were written. The column for failed is the number of rows that did not successfully load. The remaining statistics are throughput and latency metrics for the load. We can verify that the database has 150 records with a simple COUNT query (Note: We are doing this count operation via cqlsh for illustrative purposes here. We will discuss dsbulk’s count operation later.):
So, there was a lot of other information in those messages, including information about the log directory. I ran this command in /tmp, and by default dsbulk will create a subdirectory named logs in the local directory in which the command was run, and then create a subdirectory of that logs directory for each run of dsbulk. The load runs will have a subdirectory that begins LOAD_, while the unload runs will have a subdirectory that begins with UNLOAD_, and the count runs will have a subdirectory that begins with COUNT_. From the output we see:
A keen eye will notice that the filename includes the year-month-day-hour-minute-second at the time of when dsbulk was run. This log directory includes two files: operation.log and positions.txt. The operations.log file is the main log file and contains the settings used, and the main log of operations. The positions.txt file is used to pick up a partial load from where the last run left off. If there were other errors (we’ll see some later), we would see them in other files in this directory.
Example 2: Loading from stdin
dsbulk was designed to work like other command-line tools and load data from stdin and unload data to stdout. In fact, the default -url is stdin for loading and stdout for unloading.
We could also add a header to data that has no header. For example, the iris_no_header.csv has not header line, but otherwise is identical to the iris.csv file. We could use awk to prepend a header line, such as:
We could also add missing columns using command-line tools. For example, the iris_without_id.csv file is just like iris.csv, but without an id column. Since id is part of the primary key, we need to create that column. We could do that again with awk:
Okay, it’s clear that I like awk - and sed, and cut, and … - but you can use whatever command-line tools you want: perl, python, etc. Or you can write your own application using whatever language you want - Java, C, etc - that writes to stdout.
Example 2.2: Loading from a URL
dsbulk will also load from a URL, such as the HTTP address for the iris.csv example:
Example 2.4: Loading only some files in a directory
We can load only some of the files in the directory, too. Let’s put the president_birthdates.psv file in the /tmp/dsbulkblog/iris directory, and then tell dsbulk to only load the .csv files:
We will tell dsbulk to only load the *.csv files by specifying --connector.csv.fileNamePattern parameter (**/*.csv is the default for CSV files, but we do it for illustration; see here for more information on the filename patterns.):
Now, how did dsbulk know which input fields mapped to which columns of the dsbulkblog.iris_with_id table? Well, it looked at the first line, or the header line, of the input. The first line of the /tmp/dsbulkblog/iris.csv file is:
It just so happens that these column names match exactly the column names of dsbulkblog.iris_with_id. dsbulk uses this header line by default to learn how to match up the input fields to the column names. We got lucky that they match.
So, what would we do if the column names did not match? Well, we could explicitly list the mapping from the input fields to the column names. There are a few ways to do this. The first would be to list the mapping from what the file has to what the table has. For example:
Now, this example is a little silly because we have the same field names in the header as the column names. To borrow a bit from the example above where we provided the header, let’s provide a different header and then use the mapping:
As you can see, the -m takes a list of mappings from input field names, as specified in the header, to column names, as specified in the database schema. So, the input field named sepal_l gets loaded to the column named sepal_length.
There is one handy shortcut that you can send in to the -m, namely *=*. This says that all names that match should just be matched (and is the default mapping if none is provided). So, these two are the same:
We could also specify that we should load all the columns that match, but exclude a column. For example, if we did not want to load the species we could do:
Another way to specify mappings is by index, instead of name. So, consider the iris_no_header.csv file, which does not have a header line. We could do this by mapping the index of the input to the columns:
This will map the columns in order. That is, this is saying that the first field maps to the sepal_length column, the second maps to the sepal_width column, etc. If the input file has a header with field names field1, field2, etc, then this is equivalent to:
Notice that we don’t actually set -header to false. If we do that, then we need to skip the first line, otherwise we will try to load that line as data. We can do that with the -skipRecords parameter:
While we are on the subject of skipping records, we should also talk about just loading some records. We can limit the load to the first 20 records using the -maxRecords parameter:
Now, what if we did something wrong and forgot a column in the above example. Let’s say we forgot the id column:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -header false -m "sepal_length,sepal_width,petal_length,petal_width,species" Operation directory: /tmp/logs/LOAD_20190314-162512-798132 Operation LOAD_20190314-162512-798132 failed: Missing required primary key column id from schema.mapping or schema.query. Last processed positions can be found in positions.txt
We see here that we get an error because id is a primary key column, and dsbulk needs to have all primary key columns specified.
However, if we include the columns but get the wrong order, such as swapping the id and species columns, then we will get a different error:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id -m "0=sepal_length,1=sepal_width,2=petal_length,3=petal_width,4=id,5=species" Operation directory: /tmp/logs/LOAD_20190314-162539-255317 Operation LOAD_20190314-162539-255317 aborted: Too many errors, the maximum allowed is 100.
102 | 101 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 Rejected records can be found in the following file(s): mapping.bad
Errors are detailed in the following file(s): mapping-errors.log Last processed positions can be found in positions.txt
Here we see that we got more than 100 errors. But what kind of errors? To figure that out, we need to check out the /tmp/logs/LOAD_20190314-162539-255317 files, specifically, the mapping-errors.log file. The beginning of that file looks like:
Resource: file:/tmp/dsbulkblog/iris.csv
Position: 129
Source: 6.4,2.8,5.6,2.1,Iris-virginica,128\u000a
java.lang.IllegalArgumentException: Could not parse 'Iris-virginica'; accepted formats are: a valid number (e.g. '1234.56'), a valid Java numeric format (e.g. '-123.45e6'), a valid date-time pattern (e.g. '2019-03-14T16:25:41.267Z'), or a valid boolean word at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:119) at com.datastax.dsbulk.engine.internal.codecs.string.StringToNumberCodec.parseNumber(StringToNumberCodec.java:72) at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:55) at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:26) at com.datastax.dsbulk.engine.internal.codecs.ConvertingCodec.serialize(ConvertingCodec.java:50) Suppressed: java.time.format.DateTimeParseException: Text 'Iris-virginica' could not be parsed at index 0 at java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949) at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1819) at com.datastax.dsbulk.engine.internal.codecs.util.SimpleTemporalFormat.parse(SimpleTemporalFormat.java:41) at com.datastax.dsbulk.engine.internal.codecs.util.ZonedTemporalFormat.parse(ZonedTemporalFormat.java:45) at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:106) Suppressed: java.lang.NumberFormatException: For input string: "Iris-virginica" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) at java.lang.Double.parseDouble(Double.java:538) at java.lang.Double.valueOf(Double.java:502) at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:101) Suppressed: java.lang.NumberFormatException: null at java.math.BigDecimal.(BigDecimal.java:494) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:806) at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:96) at com.datastax.dsbulk.engine.internal.codecs.string.StringToNumberCodec.parseNumber(StringToNumberCodec.java:72) Suppressed: java.text.ParseException: Invalid number format: Iris-virginica at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:247) at com.datastax.dsbulk.engine.internal.codecs.util.CodecUtils.parseNumber(CodecUtils.java:92) at com.datastax.dsbulk.engine.internal.codecs.string.StringToNumberCodec.parseNumber(StringToNumberCodec.java:72) at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:55) at com.datastax.dsbulk.engine.internal.codecs.string.StringToIntegerCodec.externalToInternal(StringToIntegerCodec.java:26)
We can see that dsbulk is trying to convert Iris-setosa into a number. This is because it believes that this column is the id column, which is an INT. If we look through the file we see the same error over and over again. We’ve clearly messed up the mapping. After 100 errors (that is the default, but it can be overridden by setting -maxErrors) dsbulk quits trying to load.
Whenever dsbulk encounters an input line that it cannot parse, it will add that line to the mapping.bad file. This is the input line as it was seen by dsbulk on ingest. If a line or two got garbled or had different format than other lines, dsbulk will populate the mapping.bad file, log the error in mapping-errors.log, but keep going, until it reaches the maximum number of errors. This way, a few bad lines don’t mess up the whole load, and the user can address the few bad lines, either manually inserting them or even running dsbulk on the mapping.bad file with different arguments.
For example, to load these bad lines we could load with:
Operation LOAD_20190314-163033-590156 completed successfully in 0 seconds.
Last processed positions can be found in positions.txt
Example 3.6: Missing fields and extra fields
Sometimes we are only loading some of the columns. When the table has more columns than the input, dsbulk will by default throw an error. Now, while it is necessary that all primary key columns be specified, it is allowable to leave other columns undefined or unset.
For example, let’s say we didn’t have the sepal_length column defined. We could mimic this with awk, such as:
At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.
Operation LOAD_20190314-163121-694645 aborted: Too many errors, the maximum allowed is 100. total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches
This will load just fine, and the extra column will be ignored. However, if we wish to be strict about the inputs, we could cause dsbulk to error if there are extra fields using:
At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.
Operation LOAD_20190314-163305-346514 aborted: Too many errors, the maximum allowed is 100. total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches
Rejected records can be found in the following file(s): mapping.bad
Errors are detailed in the following file(s): mapping-errors.log
Last processed positions can be found in positions.txt
Example 3.7: Whitespace
Sometimes there is whitespace at the beginning of a text string. We sometimes want this whitespace and sometimes we do not. This is controlled by the --connector.csv.ignoreLeadingWhitespaces parameter, which defaults to false (whitespaces are retained). For example:
Notice that I needed to move the id column to the end of the list. This is because the order here matters and must match the order in the input data. In our data, the id column is last.
Example 4.3: Providing a query with constant values
We can also specify a mapping to place constant values into a column. For example, let’s set the species to the same value for all entries:
$ dsbulk load -url /tmp/dsbulkblog/iris.csv -query "INSERT INTO dsbulkblog.iris_with_id(id,petal_width,petal_length,sepal_width,sepal_length,species) VALUES (:id, :petal_width, :petal_length, :sepal_width, :sepal_length, 'some kind of iris')"
Example 4.4: Deleting data with a query
It may seem counterintuitive, but we can delete data using the dsbulk load command. Instead of an INSERT statement, we can issue a DELETE statement. For example, let’s delete the rows in our table that correspond to the first 10 lines (11 if you include the header) of the iris.csv file:
$ head -11 /tmp/dsbulkblog/iris.csv | dsbulk load -query "DELETE FROM dsbulkblog.iris_with_id WHERE id=:id" Operation directory: /tmp/logs/LOAD_20190320-180959-025572
To learn additional elements for data loading, read Part 2 of the Bulk Loader series here.
Jeffrey Chou
05.31.2022
Sync Computing presents a new kind of scheduler capable of automatically optimizing cloud resources for data pipelines to achieve runtime, cost, and reliability goals
Here at Sync, we recently launched our Apache Spark Autoutuner product, which helps people optimize their EMR and Databricks clusters on AWS. Turns out, there’s more on the roadmap for us and we recently published a technical paper our automatic globally optimized resource allocation (AGORA) scheduler, which extends beyond cluster autotuning, and towards the more general concept of resource allocation + scheduling optimization.
In this blog post we explain the high level concepts and how it can help data engineers run their production systems more reliably, hit deadlines, and lower costs — all with a click of a button. We show a simulation of our system on Alibaba’s cluster trace resulted in a 65% reduction in total runtime.
Introduction
Let’s say you’re a data engineer and you manage several production data pipelines via Airflow. Your goal is to achieve job reliability, hit your service level agreements (SLA), and minimize costs for your company. But due to changing data sizes, skew, source code, cloud infrastructure, resource contention, spot pricing, and spot availability, achieving your goals in real-time is basically impossible. So how can Sync help solve this problem? Let’s start with the basics.
The Problem
For simplicity, let’s say your data pipelines are run via Airflow, and each one of your Airflow directed acyclic graphs (DAGs) is composed of several task nodes. For simplicity let’s assume each one of the nodes in the DAG is an Apache Spark job. In the DAG image below in Fig. 1, we see four Apache Spark jobs, named “Index Analysis,” “Sentiment Analysis,” “Airline Delay,” and “Movie Rec.”
Fig. 1. Simple Airflow DAG with 4 Apache Spark nodes
You have a deadline and you need this DAG to finish within 800 seconds. How can you achieve that?
One clever way you might think of to try to achieve your goals is to optimize each job separately for runtime. You know 800s is a tight deadline, so you decide to choose configurations for each job that would give you the fastest runtime. There’s no way of figuring out what that might be, so you decide to run a set of experiments: run each job with different configurations and choose the one that gives you the fastest runtime.
Being a data engineer, you could even set up this whole process to be automated. Once you’ve figured out the configurations separately, you can use those configurations to run the DAG with the standard Airflow scheduler. However, what happens in this example is these jobs will be launched serially, maximizing the cluster each time, and it does not meet your deadline. In the figure below we see the number of vCPUs available in a cluster on the y-axis and time on the x-axis.
Fig. 2. VCPUs vs. Runtime plot of the Apache Spark jobs run with Airflow, using default schedulers
You tried to optimize each job separately for the fastest runtime, but it’s still not meeting the requirement. What could you do now to accomplish this goal? Just try increasing the cluster size to hopefully speed up the jobs? That could work but how much larger? You could run more experiments, but now the problem bounds are increasing and the growing list of experiments would take forever to finish. Also, changing the resources for an Apache Spark job is notoriously difficult, as you may cause a crash or a memory error if you don’t also change the corresponding Apache Spark configurations.
The Solution
To really see how best you can reallocate resources to hit the 800s SLA, we’ll have to look at the predicted runtime vs. resources curve for each Apache Spark job independently. For the four jobs, the predicted performance plots are shown in Fig. 3 below across 4 different instances on AWS.
We can see that depending on the job and the hardware, the number of nodes and runtime are very different depending on the job.
Fig. 3. Predicted runtime vs. resources plots for each of the 4 Apache Spark jobs
With this information, we can use AGORA to solve the scheduling problem to properly allocate resources, and re-run the DAG. Now we see that the 800s SLA is achieved, without changing the cluster size, changing Spark configurations, and obeying the DAG dependencies. What AGORA does is interesting, we see the “purple’ job gets massively compressed in terms of resources, with only a small impact on runtime. Whereas the “green” job doesn’t change much because it may blow up the runtime. Understanding which jobs can be “squished” and which cannot is critical for this optimization. In some sense, it’s a game of “squishy Tetris”!
Fig. 4. Globally optimized schedule, capable of achieving the 800s SLA
The Catch
Well, that looks great, what is so hard about that? Well, it turns out that scheduling problem is what is known in the math world as an NP-hard optimization problem. Actually solving that problem explodes into a very very difficult problem to solve quickly. With just those 4 jobs, we can see from the graphs below that to solve that schedule it can take over 1000 seconds, via brute force methods. Obviously, nobody wants to wait for that.
Fig. 5. Search space and solve time for the scheduling optimization problem of a simple DAG
The other issue is we need to predict those runtime vs. resources graphs with just a few prior runs. We don’t want to actually re-run jobs 100’s of times just to run it well once. This is where our Autotuner product comes into play. With just 1 log, we can predict the cost-runtime performance for various hardware and optimized spark configurations.
At Sync Computing, we’ve solved both issues:
Scheduling modeling & solve time: We mathematically modeled Apache Spark, cloud resources, and cloud economics to an optimization problem we can solve extremely quickly. More details on the math can be found in the technical paper.
Predicting performance on alternative hardware: Our Autotuner for Apache Spark takes in one log and can predict the performance across various machines — simultaneously accounting for hardware options, costs, spot availability, and Apache Spark configurations. See our case study here.
Goal based optimization
In the simple example above, the priority was to hit an SLA deadline, by reallocating resources. It turns out, we can also set other priorities. One obvious one is cost savings for cloud usage, in which the cost of the instances is prioritized over the total runtime. In this example, we utilized more realistics DAGs, as shown in the image below:
Fig. 6. Comprehensive DAGs more akin to realistic jobs.
For cost based optimization, what typically happens is fewer resources (less nodes) are used for each job which usually results in longer runtimes, albeit lower costs. Alternatively, we can be runtime optimized, in which more resources are used, albeit at higher costs. The simple table below highlights this general relationship. Of course knowing the exact number of nodes and runtime is highly dependent on the exact job.
When we run our goal based optimization, we show that for DAG1 we can achieve a runtime savings of 37%, or a cost savings of 78%. For DAG2, we can achieve a runtime savings is 45%, or a cost savings of 72% — it all depends on what you’re trying to achieve.
Fig. 7. Optimizing the schedule for DAGs 1 and 2 to minimize runtime.
Fig. 8. Optimizing the schedule for DAGs 1 and 2 to minimize cloud costs.
Of course other prioritization can be implemented as well, or even a mix. For example, some subset of the DAGs need to hit SLA deadlines, whereas some other subset need to be cost minimized. From a user’s perspective, all the user has to do is set the high level goals, and AGORA automatically reconfigures and reschedules the cluster to hit all goals simultaneously. The engineer can just sit back and relax.
The Solution at Scale — Alibaba Cluster Trace with 14 million tasks
So what happens if we apply our solution in a real-world large system? Fortunately, Alibaba publishes their cluster traces for academic purposes. The 2018 Alibaba cluster trace includes batch jobs run on 4034 machines over a period of 8 days. There are over 4 million jobs (represented as DAGs) and over 14 million tasks. Each machine has 96 cores and an undisclosed amount of memory.
When we simulate our solution at scale, we demonstrated total runtime/cost reduction of 65%, across the entire cluster. At the scale of Alibaba, 65% reduction over 14 million tasks is a massive amount of savings.
Fig. 9. Simulated performance of AGORA on the Alibaba cluster trace
Conclusion
We hope this article illuminates the possibilities in terms of the impact of looking globally across entire data pipelines. Here are the main take aways:
There are massive gains if you look globally: The gains we show here only get larger as we look at larger systems. Optimizing across low level hardware to high level DAGs reveals a massive opportunity.
The goals are flexible: Although we only show cost and runtime optimizations, the model is incredibly general and can account for reliability, “green” computing, or any other priority you’d like to encode.
The problem is everywhere: In this write up we focused on Airflow and Apache Spark. In reality, this general resource allocation and scheduling problem is fundamental to computer science itself — Extending to other large scale jobs (machine learning, simulations, high-performance computing), containers, microservices, etc.
At Sync, we built the Autotuner for Apache Spark, but that’s just the tip of the iceberg for us. AGORA is currently being built and tested internally here at Sync with early users. If you’d like a demo, please feel free to reach out to see if we can help you achieve your goals.
We’ll follow up this article with more concrete customer based use-cases to really demonstrate the applicability and benefits of AGORA. Stay tuned!
For newcomers to Cassandra, all the terminology can be a little overwhelming at first. If you have experience with relational databases, some concepts like a "Row" or a "Primary Key" will be familiar. But other terms that seem straightforward can often be a little confusing, especially when paired with some of the visuals you see when learning about Cassandra.
For example, take a look at this screenshot from the DataStax OpsCenter management tool:
Now answer these questions:
How many token rings are there?
How many clusters are there?
How many datacenters are there?
How about this slide that I use when introducing new users to Cassandra:
Yes, I love 80's movie references
How many movie references can you fit in a 40 minute talk? I attempted to find out with my Cassandra Summit 2015 presentation, Relational Scaling and the Temple of Gloom. You'll find this slide and a whole bunch more there.
In case you're wondering, yes, your data actually is "coming to America" in that diagram (pun fully intended). Now answer those same questions:
How many token rings are there?
How many clusters are there?
How many datacenters are there?
Did you answer two for all the questions in both pictures? If you did, you wouldn't be alone, but you also wouldn't be correct either. This is one example where the terminology can get a little confusing. Let's walk through those terms, starting with the last one and working our way backwards.
Datacenters
This is the term that most people are familiar with before starting with Cassandra and if you answered two to the question above then you're not only correct, but well on your way to understanding this concept. In the computing world, we tend to think of a datacenter as a physical place where our computers reside. The same can be true for a datacenter in Cassandra, but it doesn't necessarily have to be true.
In Cassandra, a datacenter is just a logical grouping of nodes. This grouping could be based on the physical location of your nodes. For example, we might have a us-east and a us-west datacenter. But it could also be based on something other than physical location. For example, we might set up a transactional and an analytics datacenter to run different types of workloads, but the nodes for those datacenters might be physically in the same location.
So how does Cassandra know which nodes belong to which datacenter? Well that's a little outside the scope of this post, but the short answer is a component in Cassandra called a Snitch. If you want to dig in more, there's a great explanation of the Snitch on DataStax Academy.
Most of the time when we're looking at visual depictions of Cassandra like in OpsCenter or my slide above, the nodes are being shown grouped by data center.
Clusters
If you answered two for this question, I don't blame you. After all in English, cluster is defined as "a group of things or people that are close together" and in both of those pictures, it sure looks like there are two separate groups of nodes that are close together. But as we just established, most of the time when we see pictures like those we're seeing nodes grouped by datacenter.
In Cassandra a cluster refers to all the nodes across all the datacenters that are peers (i.e. aware of each other). For both of those images, we've got two datacenters where replication is happening between the two of them. So while there are two datacenters, there's only one cluster depicted in both of those images.
Token Rings
That leaves us with the question about the number of token rings. I tried to be specific by asking about token rings instead of just rings. Oftentimes in Cassandra, the term "ring" (by itself) is used interchangeably with "cluster" to refer to all the nodes across all the datacenters. But when we say token ring we're usually referring to a specific concept in Cassandra--data distribution.
If you've been working with Cassandra then you know by now that when you create a table, you choose a Primary Key. Part of that Primary Key (usually the first column or sometimes the first group of columns) is called the Partition Key. For example, take a look at the users table from KillrVideo:
Here, the partition key is the userid column. When we insert data into that table, the value for userid is used to determine which nodes in Cassandra will actually store the data. Choosing a Primary Key is important but what does this have to do with token rings?
Well when Cassandra wants to know where to place data, it takes your Partition Key value and runs it through a consistent hashing function. The hash that comes out of this consistent hashing function is sometimes referred to as a token. And in Cassandra, nodes in your cluster own ranges (or buckets) of all the possible tokens.
So for example, let's pretend we have a hashing function that outputs tokens from 0 to 99. The distribution of those tokens across all the nodes in an eight node cluster might look something like this:
Great for Illustration, Not for Production
While you'd never use a hashing function with so few tokens like this in production, it's still a great illustration. You'll find this slide with a lot more detailed explanation in the DS201: Foundations of Apache Cassandra course.
Now this is a really simplified example because of the small range of tokens available. In real Cassandra deployments, most people stick with the default Murmur3 partitioner which outputs tokens in the range of -263 to 263 - 1, but even with the larger range available, the principle is the same.
The total range of available tokens and their distribution around the cluster is often referred to as the token ring in Cassandra. And that range of tokens is distributed around the cluster with each node owning a portion of the token ring. So even if we took our 8 node cluster above and logically grouped the nodes across two datacenters, there would still only be one token ring.
This can be really hard to wrap your brain around, especially when you see pictures like the two above. In those pictures, there are definitely two "rings" in the English sense of the word. But in Cassandra terms there's only one token ring (and only one "ring" if we're using that term interchangeably with "cluster"), even if the nodes are grouped and displayed as two datacenters.
Conclusion
If you're struggling with some of the terminology when getting started with Cassandra, hopefully this helps to clear some things up. I highly recommend checking out DS201: Foundations of Apache Cassandra on DataStax Academy for a deeper dive into many of these concepts. Things like replication (and replication strategies) are all built on top of this foundation, so understanding these concepts can go a long way towards becoming a Cassandra guru.
Note This blog post originally appeared on Planet Cassandra in April 2016 before that site was donated to the Apache Foundation and all the content was removed. I've reposted it here to preserve the content.
DataStax is releasing the first preview of Stargate, a new open source API framework that could eventually turn Apache Cassandra into a multi-model database. It's an approach that has parallels with cloud databases from Microsoft Azure and Google that also take the API approach, and more recently, from household brands like Oracle.
While the project name conjures up memories of David Bowie, the goal of Stargate is exposing Cassandra beyond the existing developer base skilled in CQL (Cassandra Query Language) or Gremlin to JavaScript developers versed in JSON, or Java developers accustomed to working with SQL. Access would be via APIs supporting full CRUD (create-read-update-delete) functions. At the starting gun, it's not surprising that the first (and for now, only) API supports Apache Cassandra with CQL and REST APIs.
Stargate is designed as a gateway that sits apart from the storage engine, running either on-premises or in the cloud. It's based on the familiar coordinator node proxy that determines how Cassandra handles requests. As a multi-master database, any node can act as the coordinator for routing the processing of a query, and the nodes are separated from storage. By utilizing the same proxies that handle CQL requests, Cassandra does not have to be rearchitected to handle other APIs.
The project, which is hosted on GitHub, is available through a standard Apache 2.0 open source license. At this point, DataStax has not announced further plans for Stargate, but it is likely that the community will tackle SQL, JSON Documents (we'd expect a MongoDB style API), GraphQL, and Gremlin. At this point, we don't know how Stargate with its Cassandra API performs compared to what is already natively baked into Apache Cassandra.
Going the API route, Stargate plots a similar path as Azure Cosmos DB, which offers five APIs including SQL, MongoDB wire protocol, Cassandra, table (for key-value), and Gremlin in the same database. (In Cosmos DB, once you pick an API for the data set, you're bound to using it.) There's also a parallel with Google, which uses the same storage engine for Cloud Spanner, which is exposed through a SQL API, and Cloud Firestore, which adheres to a JSON document API.
Conceivably, Stargate could evolve into the preferred mode of access to Cassandra, but that depends on two big "ifs." First, there must be no performance penalty compared to the existing native access approach, and secondly, the project would have to get accepted by the Apache Cassandra community and formally become part of the project.
Multi-model support is not new for DataStax Enterprise, DataStax's commercial distribution of Cassandra. Through an earlier acquisition, the DataStax platform also supported Gremlin, but prior to the DSE 6.8 release, the graph engine wasn't integrated into the core database, and so graph data had to be modeled and ingested separately. With DSE 6.8, graph views could work off the same native CQL API, off the same data ingest. But graph support was only available to DSE customers, and was not part of the core open source platform. If Stargate gets accepted by the Apache Cassandra project, that would be a way for mainstreaming use of Gremlin, and potentially, other APIs on the mother ship.
One of the most unavoidable things in software development is software aging. Technology, standards, patterns, and hardware are getting older and changing fast. One of the developer’s tasks is to maintain and keep the software up to date in order to make software age as slow as possible. Unfortunately, it’s easier said than done. New features and lack of resources leads to neglect of the software and makes it a so-called legacy system. Do you know how to save your system from becoming a legacy? Are you aware of the effects of having a legacy system? In this article, I answer these and some more questions.
Legacy system
There are a lot of definitions of legacy systems you can find on the Internet. Probably it’s hard to come across the fittest one. Wikipedia says:
In computing, a legacy system is an old method, technology, computer system, or application program, “of, relating to, or being a previous or outdated computer system,” yet still in use. Often referencing a system as “legacy” means that it paved the way for the standards that would follow it. This can also imply that the system is out of date or in need of replacement.
I fully agree with this definition and I think that it embraces the main aspect of the legacy system. I’ve heard once that the legacy system is the one that is not covered or is poorly covered by tests. I can’t agree with this definition because tests are not implying if the system is a legacy. You can write modern software without any tests but probably it’ll be harder to maintain which in turn can, of course, lead to legacy systems.
The legacy system term refers to:
an old method, technology, computer system or application program
a system which consists of outdated parts (patterns, standards, libraries, etc.)
We cannot consider the legacy system as:
a system which is not covered or is poorly covered by tests
As it is described in the Wikipedia definition a legacy system is an old method, technology, computer system or application. The legacy system will be used in the rest of the article as a representative of all the above.
Why the legacy is so problematic?
A legacy system is hard to maintain and can engage too many resources, which leads to higher costs. Difficulties in maintaining software can make it takes longer to deliver new features to a customer and increase the chance of introducing bugs. Of course, this subject can affect different departments in your organization. From the business perspective higher costs of developing the system result in lower revenues. From the human resources perspective, the legacy system can make it harder to hire new employees (Who wants to maintain the legacy system?). And finally, programmers want to develop themselves and gain experience every day. They don’t want to maintain a big legacy system until they retire so there is a huge risk that some of them will give their notice. For programmers, work on the legacy system can cause less confidence in developing and deploying it to production.
I think that the aspects I’ve described above are good encouragement to think about protecting your system from aging fast. Continuous refactoring can be a good way to do this.
What if I already have a legacy system?
If you already have a legacy system and you don’t want to turn it off, you should think about refactoring or even rewriting a whole system. But wait… rewriting a whole system to the new one? Yes, I know that you don’t have enough resources to write a new system and continuously maintain the legacy one. I also know that you have a lot of fancy ideas about new features which should be released as soon as possible. But I also know that if you don’t rewrite/refactor your system you’ll get in trouble in the future.
Strangler Pattern
All the things I’ve mentioned above sound horrible but don’t panic and relax, please. There is a way to rewrite your system slightly and safely without using too many resources and, what is very important with the possibility to continue maintaining the legacy system. The way I have in mind is called “Strangler Pattern” (Strangler Application Pattern or even Strangler Fig Application). This metaphor was firstly defined by Martin Fowler in his article “Strangler Application” which was then renamed to “Strangler Fig Application” because of its close relation with Strangler Figs which is described in Fowler’s article.
What is Strangler Pattern and what it is about?
Strangler Pattern is a way of gradually rewriting the legacy applications to keep them up to date and make them easier to maintain. As Martin Fowler describes, the goal is to create a new system around the edges of the old one in order to “strangle” this old system. This is an iterative method and can take several years until the old system is gone but thanks to this you can safely migrate your old system to the modern one.
Figure 1 Strangler Pattern idea
Considering that systems are continuously becoming legacy, we can say that the process shown above is repetitive. It means that we are rewriting legacy system to the modern one which will be legacy in the future. Martin Fowler wrote “Let’s face it, all we are doing is writing tomorrow’s legacy software today.” which is an essence in nowadays software development and everyone who is responsible for any system should be aware of that.
Figure 2 Strangler Pattern is a repetitive process
How can I start?
Strangler Pattern in theory only describes how it works and what benefits you can get from using it. In practice, there are a couple of different strategies and approaches related to the specific programming language you can use to rewrite your legacy system to the modern one.
Choose the migration strategy
There are two main strategies listed by Martin Fowler in his article:
Event Interception – a strategy based on the stream of events. If you have an event-driven system that sends important events to the outside world, this strategy is probably the best choice for you. For more information please read Martin’s article.
Asset Capture – the strategy which is based on identifying key functional parts of your system and gradually rewriting/refactoring them in a modern way in the new system. This strategy can be running along with Event Interception. For more information please read Martin’s article.
Find an approach which meets your requirements
There are also several approaches to how you can start migration between legacy and modern system for web-apps. Below are some of them:
Load Balancing – load balancing is an approach usually used on the Web Server side. The Web Server decides to what destination a request is redirected. There is an interesting article with some use cases called legacy application strangulation case studies.
Figure 3 An approach uses Load Balancing
Front Controller with Legacy Bridge – it consists of running a proper script for specified requests in the front controller of the application. The Front Controller checks which script can handle specified requests and runs this script waiting for a response.
Figure 4 An approach uses Front Controller with Legacy Bridge
Legacy Route Loader – it consists of preparing an application routing in order to run a relevant controller for specified requests. The main difference between this and the previous approach is that in the previous approach both applications are fully independent and some of the common parts are duplicated (e.g. routing). In this approach, a legacy system is run inside of the modern one which can also interact with some parts of this legacy system.
Figure 5 An approach uses Legacy Route Loader
Note: If you know more approaches to migrate legacy systems using Strangler Pattern, please share it with others on comments.
Summary
This article was meant to show you that the software aging is a common and repetitive problem. You should understand that protecting your system from aging fast is important for you and your company. It also introduced the Strangler Pattern as a tool that can help you to deal with the legacy system. Software development is a continuous process and regular care extends its vitality.
In the next article, I’ll show you a use case of Strangler Pattern for migrating the old Symfony 1.4 application to the modern Symfony 4.4.
Any sufficiently old codebase eventually starts to contain legacy code. Architecture, performance, comments, and more begin to degrade the moment after they are written.
Some parts of the codebase last longer than other parts, but inevitably new coding standards emerge to reduce technical debt. Then you have to rework a large application, with zero downtime, making a "new way" of working without breaking anything in your release or development.
The Strangler Fig Pattern is one effective way to solve this problem.
What is a Strangler Fig?
The name Strangler Fig Pattern actually comes from a collection of plants that grow by "strangling" their hosts.
They grow in areas where competition for light is intense, and they have evolved to have their seeds dispersed (normally by birds) to the top of a host tree where they can get sunlight easily.
Their roots grow down around the tree and the seedlings grow upwards to consume all the sunlight they can. This "strangles" the tree and the fig seedlings can often kill their host tree they landed on.
Here's an image of a Strangler Fig, which I found on Wikipedia .
An image of a strangler fig, where the roots grow down the tree trunk to the soil whilst the seedlings grow up above the trees canopy. Source.
So how does this apply to software? 🤔
To completely re-write a large, complex codebase with lots of different interactions often with different teams results in a planning nightmare.
In big complicated brown-field projects like this, going big-bang (where everything is released at once) generally forces you to:
understand every interaction in depth to ensure you won't break anything when you release
have all new bug fixes done both in the new and old codebase as you re-write it
keep both merged and up-to date
spend weeks in test
deal with tons of callouts and out of hours support for the new codebase's rollout
To top it all off, it normally ends with developers doing a lot of overtime along with an influx of bugs.
One big difficulty we are trying to remove when we use the Strangler Fig is making whoever is using your software aware of where your new software is now accessible.
When you are rewriting your backend, for example, if you put everything on a new endpoint and kindly ask your users to point to your new endpoint. But then if something goes wrong, you may have to ask them all to point back to the old one.
You may end up going back and forth between these two endpoints if you have really difficult bugs, which might frustrate your users.
When we use the Strangler Fig pattern we can avoid all the above.
Why Strangle Our Code
The Strangler Fig pattern aims to incrementally re-write small parts of your codebase, until after a few months/years, you have strangled all your old codebase and it can be totally removed.
The rough flow is: add a new part to your system that isn't used yet, switch on the new part of the code – normally with a feature flag so it coexists with the old code – and finally remove the old code.
Benefits of the Strangler Fig pattern
Aside from helping you avoid all the issues we've already discussed, it also:
reduces your risk when you need to update things
starts to immediately give you some benefit piece by piece
allows you to push your changes in small modular pieces, easier for release
ensures zero down time
is generally more agile
makes your rollbacks easier
allows you to spread your development on the codebase over a longer period of time if you have multiple priorities.
There are multiple ways of implementing the Strangler Fig pattern and it depends on the system you're removing, and so on. So let's get concrete and cover an example.
Façade Payment Provider Example
Let's say as an example you have a huge monolithic back-end codebase to handle payments. It is huge. A few million lines of code, with multiple endpoints, that you want to re-write into something new for your company, for a multitude of reasons.
Performance is now poor, the architecture is too confusing to onboard new developers, and there is lots of dead-code you need to remove but don't want to break anything.
Breaking a huge codebase involving customer payments might just cause the unlucky developer who pushed up last to lose their job!
Okay. How do you slowly choke out this old codebase? Even more tricky, you don't want to just put a new endpoint there and force everyone to move. You have hundreds of customers using this software, they can't just flip back and forth between your endpoints if you have bugs and need to rollback.
Just to add one last challenge, you don't want to change your interfaces to these endpoints either. Everything being passed as arguments or returned should remain the same.
The Strangler Fig Pattern-based Solution
We can create a façade that intercepts requests going to the legacy endpoints.
The new façade will hand off to the new API you've written, or hand off to the legacy API if you haven't rewritten that piece of the codebase yet.
This façade is essentially a shim to catch network requests and hand them off to the right place.
You can then gradually migrate across to the new API piece by piece, and your users will be unaware of any changes to your underlying code as you have correctly abstracted it away.
If you are doing this correctly you will generally:
Just have the legacy way at the beginning
Make the new API
Make it coexist with the legacy API, where you can turn it on and off with feature flags
Move more and more across to the new API
Delete the old way when fully migrated
The strangling part is happening piece by piece where you remove more and more responsibility from the legacy API and into the new API.
Conclusion
I hope this has explained what the Strangler Fig Pattern is along with some of its benefits.
I have seen this pattern used in real software projects and it works really effectively. It was easily one of the most complicated projects I worked on and the Strangler Fig made it so much easier.
It stops you from writing software projects for months, and then crossing your fingers and sending it to production whilst hoping you haven't forgotten anything.
There were two invaluable resources that were very useful when I was writing this:
Strangler Fig Application by Martin Fowler here, and
Avoid rewriting a legacy system from scratch, by strangling it found here.
I share my writing on Twitter if you enjoyed this article and want to see more.
Microservices is a common approach to decompose a large application into smaller, self-contained, and interconnected applications.
The migration from an existing standalone application, or monolith, to microservices is an architectural challenge that might take months or years. For this reason, it is important to look at best practices and patterns to tackle such transition in the most effective way.
One of the most important and used methodology to decompose an application is the Strangler Fig pattern. The Strangler Fig pattern, identified by Martin Fowler [1], takes the name from a fig commonly found in Asia. The fig grows on top of an existing tree, trying to push its roots down to reach the ground. Once the fig takes root, it gradually grows around the host tree, stealing the sunlight and the nutrients in the ground, eventually killing the host. When the host tree rotes away, an empty structure remains, which is the strangler fig. The figure below shows this process with the host tree depicted in gray and the fig in green.
Strangler Fig lifecycle [2]
This is a strangler fig in Hualien:
Strangler Fig in Hualien
In the context of software, the new system (fig) will be supported initially by the monolith (host tree) and then it will gradually replace it. The key is to have an incremental migration, able to stop and roll back at any moment.
Benefits of the Strangler Fig pattern:
- Incremental migration
- No (or limited) changes to the existing system
- Easy rollback at any moment
This patterns consists of three steps:
1. Identify the feature to extract
2. Implement the feature as a standalone application (microservice)
3. Reroute the requests to the microservice
The steps are highlighted in the figure below.
Feature extraction using the Strangler Fig pattern
To identify the functionality to extract is important to have the architecture diagram of the existing system and to highlight the dependencies between the modules.
To facilitate the decomposition, a module (or component) should have a clear bounded context. For example, consider the architecture diagram below:
Architecture diagram of a traditional application
The module Invoicing has no inbound dependencies from other modules, making it easier to extract. The client calls to the Invoicing module can be redirected with a reverse proxy (more details on the last section of the article).
The easiest functionality to extract doesn’t necessarily mean that is the best candidate. It’s also important to consider what are the benefits of such decomposition.
Regarding the data, there are no guarantees on how it is structured, therefore the database decomposition has to be taken into account. However, the database decomposition is a longer and more tedious process that can typically be postponed to a later stage. This is also because a main aspect of the migration to microservices is to gain “quick wins”, by showing the results incrementally and quickly.
The simplest scenario is when a functionality in the existing application has no dependencies to other modules. In that case, no changes to the monolith are required and it can be treated as black box. In the figure below, the module Inventory Management is extracted into a microservice.
Inventory Management module extraction
Until the requests are rerouted, the microservice is not in use, so it can be safely deployed to production. Additionally, the migration can be rolled back in any moment.
If the extracted module has outbound dependencies, the monolithic application can expose an API to let the microservice use such dependency. For example, the Payroll feature depicted in the figure below, has an outbound dependency to User Notifications module:
Payroll module extraction
If many functionalities in the monolith use the extracted functionality, this pattern does not work well. Extracting the User Notification module would require a substantial effort since it’s used by Payroll and Invoicing.
If the application is interfaced with HTTP protocol, the inbound requests can be rerouted using a reverse proxy.
In such case, these steps can be followed:
1. Insert the proxy between upstream requests and the monolith
2. Deploy the microservice
3. Reconfigure the proxy to redirect the calls to the microservice
Regarding step 3, it is important to don’t remove the existing functionality in the monolith, in case unexpected errors occur and a rollback is needed.
One of the easiest ways to route HTTP calls is based on the URL. For example:
- If URL matches "https://my.app/invoice/*": route the request to the Invoice microservice
- If URL matches "https://my.app/*": route the request to the monolith
Such routing rules are commonly supported by reverse proxies, like NGINX.
Additionally, the redirection can be based on specific parameters in the body of the request, but for more complex routing mechanisms the capabilities of the specific reverse proxy needs to be considered.
If the existing application uses a different protocol, e.g. SOAP, and the microservice exposes a traditional HTTP REST interface, the reverse proxy can translate the SOAP requests to HTTP requests.
As a final advice, it is better to keep the reverse proxy simple, to avoid implementing another complex system that needs to be maintained and updated frequently.
[3] Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith (Sam Newman)
Large objects are a code smell: overloaded with responsibilities and dependencies, as they continue to grow, it becomes more difficult to define what exactly they’re responsible for. Large objects are harder to reuse and slower to test. Even worse, they cost developers additional time and mental effort to understand, increasing the chance of introducing bugs. Unchecked, large objects risk turning the rest of your codebase into a ball of mud, but fear not! There are strategies for reducing the size and responsibilities of large objects. Here’s one that worked for us at Shopify, an all-in-one commerce platform supporting over one million merchants across the globe.
As you can imagine, one of the most critical areas in Shopify’s Ruby on Rails codebase is the Shop model. Shop is a hefty class with well over 3000 lines of code, and its responsibilities are numerous. When Shopify was a smaller company with a smaller codebase, Shop’s purpose was clearer: it represented an online store hosted on our platform. Today, Shopify is far more complex, and the business intentions of the Shop model are murkier. It can be described as a God Object: a class that knows and does too much.
My team, Kernel Architecture Patterns, is responsible for enforcing clean, efficient, scalable architecture in the Shopify codebase. Over the past few years, we invested a huge effort into componentizing Shopify’s monolithic codebase (see Deconstructing the Monolith) with the goal of establishing well-defined boundaries between different domains of the Shopify platform.
Not only is creating boundaries at the component-level important, but establishing boundaries between objects within a component is critical as well. It’s important that the business subdomain modelled by an object is clearly defined. This ensures that classes have clear boundaries and well-defined sets of responsibilities.
Shop’s definition is unclear, and its semantic boundaries are weak. Unfortunately, this makes it an easy target for the addition of new features and complexities. As advocates for clean, well-modelled code, it was evident that the team needed to start addressing the Shop model and move some of its business processes into more appropriate objects or components.
Knowing where to start refactoring can be a challenge, especially with a large class like Shop. One way to find a starting point is to use a code metric tool. It doesn’t really matter which one you choose, as long as it makes sense for your codebase. Our team opted to use Flog, which uses a score based on the number of assignments, branches and calls in each area of the code to understand where code quality is suffering the most. Running Flog identified a particularly disordered portion in Shop: store settings, which contains numerous “global attributes” related to a Shopify store.
Extracting store settings into more appropriate components offered a number of benefits, notably better cohesion and comprehension in Shop and the decoupling of unrelated code from the Shop model. Refactoring Shop was a daunting task—most of these settings were referenced in various places throughout the codebase, often in components that the team was unfamiliar with. We knew we’d potentially make incorrect assumptions about where these settings should be moved to. We wanted to ensure that the extraction process was well laid out, and that any steps taken were easily reversible in case we changed our minds about a modelling decision or made a mistake. Guaranteeing no downtime for Shopify was also a critical requirement, and moving from a legacy system to an entirely new system in one go seemed like a recipe for disaster.
What is the Strangler Fig Pattern?
The solution? Martin Fowler’s Strangler Fig Pattern. Don’t let the name intimidate you! The Strangler Fig Pattern offers an incremental, reliable process for refactoring code. It describes a method whereby a new system slowly grows over top of an old system until the old system is “strangled” and can simply be removed. The great thing about this approach is that changes can be incremental, monitored at all times, and the chances of something breaking unexpectedly are fairly low. The old system remains in place until we’re confident that the new system is operating as expected, and then it’s a simple matter of removing all the legacy code.
That’s a relatively vague description of the Strangler Fig Pattern, so let’s break down the 7-step process we created as we worked to extract settings from the Shop model. The following is a macro-level view of the refactor.
Macro-level view of the Strangler Fig Pattern
We’ll dive into exactly what is involved in each step, so don’t worry if this diagram is a bit overwhelming to begin with.
Step 1: Define an Interface for the Thing That Needs to Be Extracted
Define the public interface by adding methods to an existing class, or by defining a new model entirely
The first step in the refactoring process is to define the public interface for the thing being extracted. This might involve adding methods to an existing class, or it may involve defining a new model entirely. This first step is just about defining the new interface; we’ll depend on the existing interface for reading data during this step. In this example, we’ll be depending on an existing Shop object and will continue to access data from the shops database table.
Let’s look at an example involving Shopify Capital, Shopify’s finance program. Shopify Capital offers cash advances and loans to merchants to help them kick-start their business or pursue their next big goal. When a merchant is approved for financing, a boolean attribute, locked_settings, is set to true on their store. This indicates that certain functionality on the store is locked while the merchant is taking advantage of a capital loan. The locked_settings attribute is being used by the following methods in the Shop class:
We already have a pretty clear idea of the methods that need to be involved in the new interface based on the existing methods that are in the Shop class. Let’s define an interface in a new class, SettingsToLock, inside the Capital component.
As previously mentioned, we’re still reading from and writing to a Shop object at this point. Of course, it’s critical that we supply tests for the new interface as well.
We’ve clearly defined the interface for the new system. Now, clients can start using this new interface to interact with Capital settings rather than going through Shop.
Step 2: Change Calls to the Old System to Use the New System Instead
Replace calls to the existing “host” interface with calls to the new system instead
Now that we have an interface to work with, the next step in the Strangler Fig Pattern is to replace calls to the existing “host” interface with calls to the new system instead. Any objects sending messages to Shop to ask about locked settings will now direct their messages to the methods we’ve defined in Capital::SettingsToLock.
In a controller for the admin section of Shopify, we have the following method:
This can be changed to:
A simple change, but now this controller is making use of the new interface rather than going directly to the Shop object to lock settings.
Step 3: Make a New Data Source for the New System If It Requires Writing
New data source
If data is written as a part of the new interface, it should be written to a more appropriate data source. This might be a new column in an existing table, or may require the creation of a new table entirely.
Continuing on with our existing example, it seems like this data should belong in a new table. There are no existing tables in the Capital component relevant to locked settings, and we’ve created a new class to hold the business logic—these are both clues that we need a new data source.
The shops table currently looks like this in db/schema.rb
We create a new table, capital_shop_settings_locks, with a column locked_settings and a reference to a shop.
The creation of this new table marks the end of this step.
Step 4: Implement Writers in the New Model to Write to the New Data Source
Implement writers in the new model to write data to the new data source and existing data source
The next step in the Strangler Fig Pattern is a bit more involved. We need to implement writers in the new model to write data to the new data source while also writing to the existing data source.
It’s important to note that while we have a new class, Capital::SettingsToLock, and a new table, capital_shop_settings_locks, these aren’t connected at the moment. The class defining the new interface is a plain old Ruby object and solely houses business logic. We are aiming to create a separation between the business logic of store settings and the persistence (or infrastructure) logic. If you’re certain that your model’s business logic is going to stay small and uncomplicated, feel free to use a single Active Record. However, you may find that starting with a Ruby class separate from your infrastructure is simpler and faster to test and change.
At this point, we introduce a record object at the persistence layer. It will be used by the Capital::SettingsToLock class to read data from and write data to the new table. Note that the record class will effectively be kept private to the business logic class.
We accomplish this by creating a subclass of ApplicationRecord. Its responsibility is to interact with the capital_shop_settings_locks table we’ve defined. We define a class Capital::SettingsToLockRecord, map it to the table we’ve created, and add some validations on the attributes.
Let’s add some tests to ensure that the validations we’ve specified on the record model work as intended:
Now that we have Capital::SettingsToLockRecord to read from and write to the table, we need to set up Capital::SettingsToLock to access the new data source via this record class. We can start by modifying the constructor to take a repository parameter that defaults to the record class:
Next, let’s define a private getter, record. It performs find_or_initialize_by on the record model, Capital::SettingsToLockRecord, using shop_id as an argument to return an object for the specified shop.
Now, we complete this step in the Strangler Fig Pattern by starting to write to the new table. Since we’re still reading data from the original data source, we‘ll need to write to both sources in tandem until the new data source is written to and has been backfilled with the existing data. To ensure that the two data sources are always in sync, we’ll perform the writes within transactions. Let’s refresh our memories on the methods in Capital::SettingsToLock that are currently performing writes.
After duplicating the writes and wrapping these double writes in transactions, we have the following:
The last thing to do is to add tests that ensure that lock and unlock are indeed persisting data to the new table. We control the output of SettingsToLockRecord’s find_or_initialize_by, stubbing the method call to return a mock record.
At this point, we are successfully writing to both sources. That concludes the work for this step.
Step 5: Backfill the New Data Source with Existing Data
Backfill the data
The next step in the Strangler Fig Pattern involves backfilling data to the new data source from the old data source. While we’re writing new data to the new table, we need to ensure that all of the existing data in the shops table for locked_settings is ported over to capital_shop_settings_locks.
In order to backfill data to the new table, we’ll need a job that iterates over all shops and creates record objects from the data on each one. Shopify developed an open-source iteration API as an extension to Active Job. It offers safer iterations over collections of objects and is ideal for a scenario like this. There are two key methods in the iteration API: build_enumerator specifies the collection of items to be iterated over, and each_iteration defines the actions to be taken out on each object in the collection. In the backfill task, we specify that we’d like to iterate over every shop record, and each_iteration contains the logic for creating or updating a Capital::SettingsToLockRecord object given a store. The alternative is to make use of Rails’ Active Job framework and write a simple job that iterates over the Shop collection.
Some comments about the backfill task: the first is that we’re placing a pessimistic lock on the Shop object prior to updating the settings record object. This is done to ensure data consistency across the old and new tables in a scenario where a double write occurs at the same time as a row update in the backfill task. The second thing to note is the use of a logger to output information in the case of a persistence failure when updating the settings record object. Logging is extremely helpful in pinpointing the cause of persistence failures in a backfill task such as this one, should they occur.
We include some tests for the job as well. The first tests the happy path and ensures that we're creating and updating settings records for every Shop object. The other tests the unhappy path in which a settings record update fails and ensures that the appropriate logs are generated
After writing the backfill task, we enqueue it via a Rails migration:
Once the task has run successfully, we celebrate that the old and new data sources are in sync. It’s wise to compare the data from both tables to ensure that the two data sources are indeed in sync and that the backfill hasn’t failed anywhere.
Step 6: Change the Methods in the Newly Defined Interface to Read Data from the New Source
Change the reader methods to use the new data source
The remaining steps of the Strangler Fig Pattern are fairly straightforward. Now that we have a new data source that is up to date with the old data source and is being written to reliably, we can change the reader methods in the business logic class to use the new data source via the record object. With our existing example, we only have one reader method:
It’s as simple as changing this method to go through the record object to access locked_settings:
Step 7: Stop Writing to the Old Source and Delete Legacy Code
Remove the now-unused, “strangled” code from the codebase
We’ve made it to the final step in our code strangling! At this point, all objects are accessing locked_settings through the Capital::SettingsToLock interface, and this interface is reading from and writing to the new data source via the Capital::SettingsToLockRecord model. The only thing left to do is remove the now-unused, “strangled” code from the codebase.
In Capital::SettingsToLock, we remove the writes to the old data source in lock and unlock and get rid of the getter for shop. Let’s review what Capital::SettingsToLock looks like.
After the changes, it looks like this:
We can remove the tests in Capital::SettingsToLockTest that assert that lock and unlock write to the shops table as well.
Last but not least, we remove the old code from the Shop model, and drop the column from the shops table.
With that, we’ve successfully extracted a store settings column from the Shop model using the Strangler Fig Pattern! The new system is in place, and all remnants of the old system are gone.
In summary, we’ve followed a clear 7-step process known as the Strangler Fig Pattern to extract a portion of business logic and data from one model and move it into another:
We defined the interface for the new system.
We incrementally replaced reads to the old system with reads to the new interface.
We defined a new table to hold the data and created a record for the business logic model to use to interface with the database.
We began writing to the new data source from the new system.
We backfilled the new data source with existing data from the old data source.
We changed the readers in the new business logic model to read data from the new table.
Finally, we stopped writing to the old data source and deleted the remaining legacy code.
The appeal of the Strangler Fig Pattern is evident. It reduces the complexity of the refactoring journey by offering an incremental, well-defined execution plan for replacing a legacy system with new code. This incremental migration to a new system allows for constant monitoring and minimizes the chances of something breaking mid-process. With each step, developers can confidently move towards a refactored architecture while ensuring that the application is still up and tests are green. We encourage you to try out the Strangler Fig Pattern with a small system that already has good test coverage in place. Best of luck in future code-strangling endeavors!