In Data Engineer’s Lunch #5: What is a Data Lake?, we discuss what data lakes are, why we need them, how we get data in and out, and different implementations of data lakes. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Data Engineer’s Lunch in person, it is hosted every Monday at 12 PM EST. Register here now!
In Data Engineer’s Lunch #5, we discuss what data lakes are, why we need them, how we get data in and out, and different implementations of data lakes. If you want a more in-depth discussion, be sure to watch the live recording of Data Engineer’s Lunch #5 embedded below! Don’t forget to like and subscribe while you watch it!
Executive memory problem: Many people don’t understand that a data-lake can just be BigQuery these days. Data lake/ data warehouse triggers a lot of PTSD in executives who have lived through bad data lake/warehouse projects and don’t understand that the cost and complexity have come down a lot.
Question from Will Angel
Garbage in Garbage Out: How do we avoid our data lakes turning into data swamps?
Answer from Nirmal
Stream data in via Kafka (requires some filtration)
Leverage a data catalog (metadata, schema, name)
Other ideas
Different data lakes for ingestion, cleaner data, not quite a warehouse
Dataset identification / governance
Use databricks bronze/silver/gold terminology
How do we get data into and out of a data lake?
Ingress
Extract Load Transform (ELT)
Extract Transform Load (ETL)
Stream into it (Kafka, Spark streaming, Flink, Alpakka)
Batch into it (*, Spark, MapReduce, etc.)
Egress
Integration to query engines out of the box
Cloud
Snowflake
Storage: S3/Azure Storage
Query: Snowflake Query Language
Google BigQuery
Storage: Google Storage
Query: BigQuery
Azure Data Analytics
Storage: Azure Storage
Query: Azure Data Analytics
Amazon Redshift Spectrum
Storage: S3
Query: SQL
Amazon Athena
Amazon Glue
Open Source
Presto
Hive
SparkSQL / Spark
Stream out of it (Spark streaming, Flink, Kafka, Alpakka)
If you missed last week’s Data Engineer’s Lunch #4: Airflow for Data Engineering, be sure to check it out! As mentioned above, the live recording of Data Engineer’s Lunch #5 is embedded below. Also, check out our YouTube page for more videos and the Data Engineer’s Lunch playlist here! Don’t forget to subscribe while you are there!
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Posted in Modern Business|Comments Off on Data Engineer’s Lunch #5: What is a Data Lake?