In case you missed it, last week we started the first of what will be weekly Data Engineering lunch discussions on Monday, November 16th. The first lunch covered the Data Engineering Road-map. We discussed how to become a proficient data engineer and the various masteries that are required. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at 12 PM EST. Register here now!
In our first Data Engineer’s Lunch, we discussed the Data Engineering roadmap. We started with a general path that covers a number of fields related to data engineering. We then moved on to looking at some specific guides on becoming a data engineer.
Starting with our own exploration of data engineering skills, we went through a number of categories, naming skills and masteries that a good data engineer needs. For programming languages, we listed Python, Scala, and Java. For scripting and automation skills we have shell scripting, cron scheduling, and cl utils like Sed, Awk, and JQ.
Next we have databases, split into SQL and NoSQL types and linked with other tech like data lakes and data warehouses. We also cover data modeling here for structured, unstructured, and semi-structured data. Data processing skills include more shell scripting with Sed/Awk, Python, and Perl. It also includes batch processing using Spring Data/Batch and Apache Spark as well as micro batch processing with those same technologies and also Flink. Lastly it includes stream processing with Spark Streaming, Flink, and Kafka.
Then we talked about scheduling beyond cron, using tools like Airflow. We covered cloud platforms and technologies based on them. AWS includes S3, DMS, and ElasticMapReduce (EMR). Google Cloud Platform includes Dataproc, and Azure includes Azure Databricks and HDInsight. After that we moved on to infrastructure using Terraform, Ansible, Docker, and Kubernetes. Then we briefly covered emerging technologies like Serverless functions and Deltalake.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!