Business Platform Team

Anant Corporation Blog: Our research, knowledge, thoughts, and recommendations about building and managing online business platforms.

Tag Archives: apache spark


MLflow and Spark

Data Engineer’s Lunch #11: MLFlow and Spark

In Data Engineer’s Lunch #11: MLFlow and Spark, we discussed using MLFlow, a machine learning management tool, with Apache Spark. This acts both as a continuation of a previous series on Apache Spark Companion Technologies and one of our data engineer’s lunch events. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!

Continue reading

Apache Spark Companion Technologies: Distributed Machine Learning Frameworks

One of Apache Spark’s main core features is Spark MLLib, a library for doing machine learning in Spark. Most data science education relies on specific machine learning libraries, like Sci-Kit Learn. Having data scientists retrain to use Spark MLLib can be an extra cost on top of the data engineering work that needs to be done in the first place, just to use Spark. Databricks offers distributed versions of some of these Machine Learning frameworks as part of the Databricks platform.

Continue reading

Apache Cassandra Lunch #30: Cassandra & Spark Foundations

In case you missed it, this blog post is a recap of Cassandra Lunch #30, covering the basics of using Spark and Cassandra together. We discuss the advantages of each and then cover the advantages of using them together. We also discuss the potential drawbacks, and configuration methods for avoiding those drawbacks. The live recording of Cassandra Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Continue reading

Counting in Datastax Enterprise

In this blog, we discuss the various methods of counting rows offered in Datastax Enterprise. Traditional databases maintain counts as a matter of course, but Cassandra’s architecture makes that more difficult. This architecture is the same thing that provides Cassandra its advantages, being scale-able, distributed, and having fast reads and writes. Since we still need row counts, we can use DSE’s features to get them in various circumstances.

Continue reading

Run an Apache Spark Job on DataStax Astra

In this blog, we discuss how to run an Apache Spark job on our DataStax Astra database. Also, a webinar recording is embedded below if you want to watch a live demo where we use Gitpod, sbt, Scala, and Spark-Submit to run 2 Spark jobs against our DataStax Astra instance. If you missed part 1 of this series: Connect Apache Spark and DataStax Astra, it will be linked below.

Continue reading