Anant Corporation Blog: Our research, knowledge, thoughts, and recommendations about building and managing online business platforms.
One of Apache Spark’s main core features is Spark MLLib, a library for doing machine learning in Spark. Most data science education relies on specific machine learning libraries, like Sci-Kit Learn. Having data scientists retrain to use Spark MLLib can be an extra cost on top of the data engineering work that needs to be done in the first place, just to use Spark. Databricks offers distributed versions of some of these Machine Learning frameworks as part of the Databricks platform.Continue reading
Model deployment is the process that we take to put our trained models to work. It involves moving our model to somewhere with the resources to do serious processing. That place also needs the ability to receive or retrieve data to be processed. We place that trained model within an architecture that delivers data to the model for processing. It then retrieves and delivers or stores the results so that they can be used or seen by users. Similar choices need to be made about whether the model gets retrained, updated, or replaced during operation.Continue reading
Model-selection tests are used to determine which of the two trained machine learning models performs better. The point of model selection tests is to predict which model will generalize better to unseen data and thus comparisons of single test results are not enough. Today we will run through a number of different model selection tests, discuss how they work and how we interpret their results.Continue reading
Machine learning is increasingly becoming a part of people’s business platforms. In order to make full use of machine learning in our business platforms, we will need a tool with similar characteristics to our database tools. It needs to be distributed and scale-able, and integrate near seamlessly with our data store. Luckily Spark is a great tool for this purpose. In this post and future ones, we will learn about how to set up an environment for performing machine learning using Apache Spark and Cassandra, and also learning more about machine learning in general.Continue reading
The first part of any machine learning project is to gather data. This sounds easy. You may think that this puts you in the perfect position to work with data you have in relational databases. In some circumstances that may be correct. However, most of the ways that we store data in databases for business platforms are sub-optimal for using machine learning. They require more work to gain the insights we want out of our data.Continue reading