In Apache Cassandra Lunch #43: DSBulk with sed and awk, we discuss how we can use the DataStax Bulk Loader with sed and awk for Cassandra data operations. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
In Apache Cassandra Lunch #43: DSBulk with sed and awk, we discuss how we can use the DataStax Bulk Loader with sed and awk for Cassandra data operations. We have a walkthrough to show you how you can do ETL from different Cassandra instances using DSBulk (E + L) with sed and awk (T). The live recording embedded below contains a live demo as well, so be sure to watch that as well!
Also, check out this blog for a series that will outline the general approaches for data operations in business-critical environments that leverage Cassandra and must maintain high-availability in an agile development and continuous delivery environment.
If you are not familiar with the DataStax Bulk Loader, it is an open-source tool that can be used to load and unload CSV or JSON data in and out of supported databases. The tool can be used with DataStax Astra, DSE, and open-source Apache Cassandra. More on the tool can be found here, and you can also check out the GitHub repository for additional information.
We can now move onto the walkthrough portion of the blog. First, you will need to go this GitHub repository. You can follow the instructions on the repository, or follow the instructions on this blog, but they will be the same. In this walkthrough, we will be using dsbulk to unload data from a DataStax Astra instance, do data transformations using awk and sed, and then load it into a Dockerized Apache Cassandra instance. You can do this walkthrough with any combination of 2 Cassandra distributions mentioned above. If you are working with deployed instances, you can use their contact points and more on that can be found here.
We will introduce the scenario that this example follows: Someone on our team wants us to take the previous_employees_by_title table and create a new table that has the time worked in days instead of a start time and end time. This can be done within the same instance, but for the purposes of showing more of what dsbulk can do, we opted for moving data between DataStax Astra and a Dockerized instance of Apache Cassandra.
cd
into itgit clone https://github.com/Anant/example-cassandra-dsbulk-with-sed-and-awk.git
cd example-cassandra-dsbulk-with-sed-and-awk
Create Database
button on the dashboardGet Started
button on the dashboardThis will be a pay-as-you-go method, but they won’t ask for a payment method until you exceed $25 worth of operations on your account. We won’t be using nearly that amount, so it’s essentially a free Cassandra database in the cloud.
test
create database
and wait a couple minutes for it to spin up and become active
.dashboard/<your-db-name>
, click the Settings
menu tab.Admin User
for role and hit generate token.Secure Bundle
Connect
tab in the menuNode.js
(doesn’t matter which option under Connect using a driver
)Secure Bundle
Secure Bundle
into the cloned directory.Upload Data
buttonprevious_employees_by_title.csv
file into the section.Next
buttonprevious_employees_by_title
employee_id
from text to uuidfirst_day
and last_day
to timestampjob_title
as the Partition Keyemployee_id
as the Clustering ColumnNext
buttontest
as the target keyspaceNext
button to begin loading the csv.docker run --name cassandra -p 9042:9042 -d -v "$(pwd)":/example-cassandra-dsbulk-with-sed-and-awk cassandra:latest
docker exec -it cassandra cqlsh
source '/example-cassandra-dsbulk-with-sed-and-awk/days_worked_by_previous_employees_by_job_title.cql'
We will cover 2 methods for loading data into Dockerized Apache Cassandra after unloading it from DataStax Astra. The first method breaks up the unloading and loading steps into 2 seperate methods. The second method does everything in one pipe.
In this command, we are unloading from DataStax Astra, running an awk script, doing a sed transformation, and then writing the output to a CSV. The awk script cleans up the timestamp format of the first_day
and last_day
columns so that we can use the mktime()
function to calculate the time in number of seconds. Then, the script prints out the first 3 columns and then the calculated duration of time worked in days per row. We also have some conditionals saying if that the value returned is negative, then make it positive. This occurred because we used a CSV generator and calculated random datetimes for those 2 columns for ~100,000 rows. Then, we use sed to fix the header that awk spits out with what we want for the destination table and write the output to a new CSV called days_worked_by_previous_employees_by_job_title.csv
.
NOTE: Input your specific variables for the placeholders in the command
dsbulk unload -k test -t previous_employees_by_title -b "secure-connect-<db>.zip" -u <Client ID> -p <Client Secret> | gawk -F, -f duration_calc.awk | sed 's/job_title,employee_id,employee_name,0/job_title,employee_id,employee_name,number_of_days_worked/' > days_worked_by_previous_employees_by_job_title.csv
dsbulk load -url days_worked_by_previous_employees_by_job_title.csv -k test -t days_worked_by_previous_employees_by_job_title
select count(*) from test.days_worked_by_previous_employees_by_job_title ;
truncate table test.days_worked_by_previous_employees_by_job_title ;
dsbulk unload -k test -t previous_employees_by_title -b "secure-connect-<db>.zip" -u <Client ID> -p <Client Secret> | gawk -F, -f duration_calc.awk | sed 's/job_title,employee_id,employee_name,0/job_title,employee_id,employee_name,number_of_days_worked/' | dsbulk load -k test -t days_worked_by_previous_employees_by_job_title
select count(*) from test.days_worked_by_previous_employees_by_job_title ;
And that wraps up how we can quickly do some Cassandra data operations using dsbulk with sed and awk. Again, if you want to watch this walkthrough live, we have a demonstration in the live recording of Apache Cassandra Lunch #43: DSBulk with sed and awk embedded below.
If you missed last week’s Apache Cassandra Lunch #42: SSTable Files with SSTableloader, be sure to check that out as well! If you want to attend Cassandra Lunch live every Wednesday at 12 PM EST, then you can register here now! Additionally, the playlist with all the previously recorded Cassandra Lunches is available here.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!