In Data Engineer’s Lunch #5: What is a Data Lake?, we discuss what data lakes are, why we need them, how we get data in and out, and different implementations of data lakes. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Data Engineer’s Lunch in person, it is hosted every Monday at 12 PM EST. Register here now!
In Data Engineer’s Lunch #5, we discuss what data lakes are, why we need them, how we get data in and out, and different implementations of data lakes. If you want a more in-depth discussion, be sure to watch the live recording of Data Engineer’s Lunch #5 embedded below! Don’t forget to like and subscribe while you watch it!
Executive memory problem: Many people don’t understand that a data-lake can just be BigQuery these days. Data lake/ data warehouse triggers a lot of PTSD in executives who have lived through bad data lake/warehouse projects and don’t understand that the cost and complexity have come down a lot.
Question from Will Angel
Garbage in Garbage Out: How do we avoid our data lakes turning into data swamps?
Answer from Nirmal
Stream data in via Kafka (requires some filtration)
Leverage a data catalog (metadata, schema, name)
Different data lakes for ingestion, cleaner data, not quite a warehouse
Dataset identification / governance
Use databricks bronze/silver/gold terminology
How do we get data into and out of a data lake?
Extract Load Transform (ELT)
Extract Transform Load (ETL)
Stream into it (Kafka, Spark streaming, Flink, Alpakka)
Batch into it (*, Spark, MapReduce, etc.)
Integration to query engines out of the box
Storage: S3/Azure Storage
Query: Snowflake Query Language
Storage: Google Storage
Azure Data Analytics
Storage: Azure Storage
Query: Azure Data Analytics
Amazon Redshift Spectrum
SparkSQL / Spark
Stream out of it (Spark streaming, Flink, Kafka, Alpakka)
If you missed last week’s Data Engineer’s Lunch #4: Airflow for Data Engineering, be sure to check it out! As mentioned above, the live recording of Data Engineer’s Lunch #5 is embedded below. Also, check out our YouTube page for more videos and the Data Engineer’s Lunch playlist here! Don’t forget to subscribe while you are there!
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Posted in Modern Business|Comments Off on Data Engineer’s Lunch #5: What is a Data Lake?