In Data Engineer’s Lunch #19: Introduction to jq for Data Engineering, we introduce jq and discuss how we can use it for data engineering. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
In Data Engineer’s Lunch #19: Introduction to jq for Data Engineering, we introduce jq and discuss how we can use it for data engineering. jq is a command-line tool like sed for JSON data and can be used to slice, filter, map, and transform structured data. If you missed our talk on sed and/or awk, find them on youtube.com/anantcorp.
jq is written in C and has no runtime dependencies, so we know it is going to be fast. Some typical use cases for jq include parsing and manipulating JSON data and more specifically things like extracting a particular field of an object, converting a number to a string, or various other standard tasks.
jq works by taking an input and parsing it as a sequence of whitespace-separated JSON values, which are then passed through the provided filter(s) one at a time. The output of the filter(s) are written to STDOUT as a sequence of whitespace-separated JSON data. jq has pretty in-depth documentation and examples, so it is definitely worth it to check it out. Also in the documentation is a fairly large list of built-in operations and functions, which just goes to show how robust the tool is.
The basic syntax for a jq command is as follows:
jq -options 'program' input-file
Now, we will move onto the walkthrough portion. If you want to watch a live recording of the walkthrough, it is embedded below. We use this GitHub repository for the walkthrough, or you can follow along below.
We recommend going through this walkthrough in Gitpod as Gitpod will have everything we need for this walkthrough. Hit the button below to get started!
If you prefer to do this locally, then you may need to download the latest version of jq if you do not have it on your local OS.
The simplest jq program is the expression
., which takes the input and outputs it as is.
1.1.1 – Print using input file
jq '.' cars.json
1.1.1 – Print using output of pipe.
cat cars.json | jq '.'
jq 'length' cars.json
jq '.' cars.json
jq '., .' cars.json
| operator combines two filters by feeding the output(s) of the one on the left into the input of the one on the right. It’s pretty much the same as the Unix shell’s pipe, if you’re used to that.
jq '. | keys' cars.json
jq '..id' cars.json
3.3.1 – Top level
jq '. | .id, .location' cars.json
3.3.2 – Nested
jq '..car | .make, .model, .year' cars.json
jq '..location |= "Virginia Beach"' cars.json
jq '..car | select(.make == "Buick")' cars.json
jq '..car | select(.year > 2000)' cars.json
jq 'del(..id)' cars.json
jq 'del(..car.year) | del(..car.make)' cars.json
The next part of the walkthrough will focus on a scenario where someone could use
jq for data engineering/wrangling in a “real world setting”. At the time of creating this repo, the 2021 MLB season has just started up. Nowadays, data is heavily ingrained into professional sports, but baseball especially with Sabermetrics (list of just offensive statistics). Moving forward, we will use
jq to do some data engineering/wrangling and give our analysts some basic statistics to work with.
curl 'http://lookup-service-prod.mlb.com/json/named.roster_40.bam?team_id='143'' | jq '.roster_40.queryResults.row' > phillies.json
We could sort or group by to do some filtering; however, we don’t use it moving forward. Here is what those options could look like though.
7.2.1 – Sort by position. sort_by(foo) compares two elements by comparing the result of foo on each element.
jq 'sort_by(.position_txt)' phillies.json
7.2.2 – Group by position. group_by(.foo) takes as input an array, groups the elements having the same .foo field into separate arrays, and produces all of these arrays as elements of a larger array, sorted by the value of the .foo field.
jq 'group_by(.position_txt)' phillies.json
-s: Read the entire input stream into a large array
jq '. | select(.position_txt == "P")' phillies.json | jq -s '.' > pitchers.json
The script file
toCSV.jq does the following:
jq -r -f toCSV.jq pitchers.json > pitchers.csv
7.5.1 – Aaron Nola (SP).
curl 'http://lookup-service-prod.mlb.com/json/named.sport_career_pitching.bam?league_list_id=%27mlb%27&game_type=%27R%27&player_id=%27605400%27' | jq '.sport_career_pitching.queryResults.row' > nola.json
7.5.2 – Zack Wheeler (SP2).
curl 'http://lookup-service-prod.mlb.com/json/named.sport_career_pitching.bam?league_list_id=%27mlb%27&game_type=%27R%27&player_id=%27554430%27' | jq '.sport_career_pitching.queryResults.row' > wheeler.json
7.5.3 – Zach Eflin (SP3).
curl 'http://lookup-service-prod.mlb.com/json/named.sport_career_pitching.bam?league_list_id=%27mlb%27&game_type=%27R%27&player_id=%27621107%27' | jq '.sport_career_pitching.queryResults.row' > eflin.json
7.6.1 – Aaron Nola (SP).
jq '.full_name = "Aaron Nola"' nola.json > nola.json.tmp && mv nola.json.tmp nola.json
7.6.2 – Zack Wheeler (SP2).
jq '.full_name = "Zack Wheeler"' wheeler.json > wheeler.json.tmp && mv wheeler.json.tmp wheeler.json
7.6.3 – Zach Eflin (SP3).
jq '.full_name = "Zach Eflin"' eflin.json > eflin.json.tmp && mv eflin.json.tmp eflin.json
jq -s '.' nola.json eflin.json wheeler.json > rotation.json
jq -r -f toCSV.jq rotation.json > rotation.csv
We can then hand this CSV back to our analytics team and they can then use it to potentially gameplan for the series with / against these pitchers. With that, we will wrap up our walkthrough on basic
jq operations as well as a potential real-world scenario in which we can use a tool like
jq to do some fast data engineering/wrangling.
Again, if you want a more in-depth discussion and live walkthrough, be sure to check out the embedded YouTube video below. If you missed last week’s Data Engineer’s Lunch #18: Luigi for Scheduling, be sure to check that out as well.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!