StitchData: ETL Made Simple

StitchData: ETL Made Simple

StitchData is a no-code platform that simplifies the ETL process. In this post we are going to cover setting up StitchData for the first time and getting your first extract, transform, and load process running smoothly.

https://www.stitchdata.com/

In this example, we are focusing on moving data from a Google Sheet over to Google BigQuery. This is a great example of how a business might be scaling up and starting to expand beyond the capabilities of spreadsheets and into databases. If you are unfamiliar with BigQuery, it is a cloud-based data warehouse that utilizes SQL. Databases like this have quite a few advantages over traditional spreadsheets. To name a few, you can leverage far larger datasets, they enforce stricter organization and structure, reduce the risk of becoming compromised or creating accidental errors, and allow for greater collaboration with more accessibility.

StitchData allows just about anyone to move data because it requires no coding and has straightforward instructions. You can follow along with the walkthrough in this post by using the free trial found on their website. All of the site links will be provided in the steps associated with them.

Before You Start

Before you begin, you need to make sure you have a game plan. Step one is to go through your sheet and make sure the data is clean and ready for the ETL process. Cleaning data will involve looking through your dataset and properly formatting data types, removing duplicates, correcting incomplete or inconsistent data, and generally smoothing out the dataset. The data I’m working with can be found at this link.

https://www.kaggle.com/datasets/harlfoxem/housesalesprediction

  • Make sure you have a header in the first row that has unique values.
    • Note that capitalization does not make a cell unique from another.
  • The second row of your data must be full.
    • If it is not full, StitchData will return an empty or malformed sheet error.
    • Avoid NULLS as StitchData will not be able to read them.
  • Ensure you have the same data types down each of your columns.

Now that your data is set up, we need to decide on what destination we are going to send it to. Again, for this case, we will be sending it to Google BigQuery.

Setup Your Integration

It’s time to jump into StitchData and set up your account.

https://www.stitchdata.com/

  • Once you are logged in, start by adding an Integration and selecting Google Sheets.
    • This is the source of your data, so scroll through and find the application that applies to your use case.
StitchData: Start by Adding an Integration

For the configuration of Google Sheets, all you need to do is decide on a name that will be transferred over to BigQuery, link your sheet ID, set your historical data sync date, set a replication frequency, and choose what tables/columns you want. There are details about these choices included in the next steps. 

  • Name your integration.
    • It is important to think about how your database naming conventions are set up and pick something that aligns with that structure.
  • Link your spreadsheet ID.
    • This will be unique to you, so view the picture below to see where to find it.
Google Sheets: Sheet ID
  • Leave historical sync date as default in this walkthrough as we are just using a static spreadsheet.
    • If you are using a different tool, Salesforce, for example, you might want to adjust the date to avoid an overload of data, so you only pull the specific data you are wanting to work with.
  • Adjust replication frequency to fit whatever case fits your application. Leave this as the default as well.
    • If your dataset is constantly updating with fresh data and you are building live visualized dashboards, you might want to set it as frequently as it can go.
    • If you want all your data to be pulled at the end of the day so your analysts can work with it first thing in the morning, you can set it to extract overnight.

At this point, your integration should look similar to this.

StitchData: Configure Your Google Sheets Integration

StitchData will now connect to your source, and if successful, you will be moved to a screen where you can select what data you want to work with. StitchData is highly useful in this aspect because you can select exactly what you are wanting to put into a table at your destination.

StitchData: Choose Your Data

If you need further assistance, you can view this guide. https://www.stitchdata.com/docs/integrations/saas/google-sheets

Your integration is now complete so it’s time to configure the warehouse, Google BigQuery.

Configure Your Warehouse

Now you can select the destination you want to use. For the demo, we are going to select BigQuery. Within the form that loads you will see there are a couple of steps in BigQuery that need to be completed for StitchData to load properly.

  • Move over to Google Cloud Platform (GCP) and create a new account/sign in. https://cloud.google.com/gcp
  • Make a new project that you want your data to be loaded into.

At this point, it might seem like you’re good to go, however, for the load to run properly you need to set up a couple of permissions to give administrator rights.

StitchData: You Will Need the Following
  • On the left side navigation bar click on IAM & Admin followed by Service Accounts and create a new one.
    • The naming of this account doesn’t matter, so name it how you see fit. 
  • Assign the account the role of BigQuery Admin and Storage Admin at a minimum.
    • Without the admin roles set, BigQuery will not be able to load data into the database.
    • Optionally, assign Owner if you are the only one using the project.
GCP: Service Account Admin Permissions

The service account is now mostly set up, so click done.

  • Within the service, account click over to keys and create a new key. Select JSON and save the file it creates.
    • This key holds the GCP project information that StitchData needs to be able to connect.
GCP: Key

We can move back over to StitchData and in the connection window, fill out the information, upload your key, and click done.

If you need further assistance, you can view this guide.

https://www.stitchdata.com/docs/destinations/google-bigquery/v2/connecting-google-bigquery-to-stitch

Your destination configuration is now complete, so let’s run the first extract and load.

Run Your Integration

Everything is set up and it’s time to see if it works. 

  • From the landing page of your complete integration click over to Extractions and Run Extraction Now.

This page will show you everything you need to know about the extraction process, when they were run, and the logs to tell you if it was successful or any errors that were encountered. Depending on the size of your data it might take a little while for it to complete. Once it finishes, the in-progress notation will disappear.

StitchData: Extraction
  • Move back over to summary and you will see if the extraction was successful as well as when the load was completed.
    • This might take a while to populate so feel free to refresh the page a couple of times until it does.
StitchData: Summary

Finally, we can move over to GCP one last time and confirm that it was a success. 

  • In the dropdown menu on the left select BigQuery and SQL Workspace. 
    • Open the database dropdowns and you should see your project as well as a couple of other tables.

For now, we can just focus on the sheet you want to work with and run a test query on it. 

  • Run a query and comb through the data and validate that it is correct.
    • If there are errors, you can jump into those tables and query them to find anything that might have been incompatible or caused problems.
Google BigQuery: Query Your Data

If you have confirmed that your data is correct, you are good to go and your integration is complete!

Conclusion

Congratulations, you have gotten started with your first StitchData ETL integration! Now you can take this concept and apply it to whatever business case you might have. All in all, StitchData is a very intuitive and powerful tool so there’s not much to complain about. However, I’ll touch on a few key points I found noteworthy.

  • Pros
    • There is a huge library of integrations to choose from.
    • They give you the option to suggest an integration if it isn’t listed.
    • Easy to read documentation is provided.
    • The pricing is competitive for the functionality it produces if you’re a business moving large amounts of data.
  • Cons
    • Sometimes the integration will freeze up and you will have to refresh or deactivate/reactive to get it to run.
    • There are no true transformation options. You can only select what tables and columns you want to work with. It would be nice if you could create new columns or aggregates during this stage.
    • As an individual user, pricing can be rather steep.

In addition to StitchData, here are some competitors you might want to check out. Coupler.io, Fivetran, Hevo Data, and Integrate.io. Compare their features and find out what is going to work best for you and your organization!

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!


Join Anant's Newsletter

Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!