Model deployment is the process that we take to put our trained models to work. It involves moving our model to somewhere with the resources to do serious processing. That place also needs the ability to receive or retrieve data to be processed. We place that trained model within an architecture that delivers data to the model for processing. It then retrieves and delivers or stores the results so that they can be used or seen by users. Similar choices need to be made about whether the model gets retrained, updated, or replaced during operation.
In the graphic we used at the beginning of this series, model deployment is the section labeled prediction that comes at the end of the learning process. In this phase, new data comes in and our model produces labels. While this graphic implies that once we move into this phase, no more training or testing is done, it is often necessary to keep an eye on the performance of our model over time. If the distribution of data coming in shifts and our model no longer predicts well, then we would need to replace it.
Our first step when moving a model to production is to move the data that is the model into a production environment. The structure of that environment can vary wildly depending on the architecture around it and the predicted uses for the model. In this case, our models are python objects in pyspark. Some machine learning modules like scikit-learn and PyTorch come with utilities to save and load models or built-in compatibility layers with other tools that can save and load models.
The most basic option for saving pyspark models is to use pickle. Pickle can save any Python object, making it a good option for transferring models between python contexts. As long as both ends are python environments, models can be transferred just fine.
ONNX is a format meant for deep learning models, but it also supports storing predictive models across different libraries and even languages. ONNX supports some pyspark models, but not all of them. Specifically, Vectorizers and Encoders, Scalers, and some MLLib models are supported. The python module onnxmltools has some options for working with pyspark modules.
PMML is another format. It’s specialized for predictive models but only supports certain types. It supports sklearn, but only a small number of pyspark models are supported. PMML has been around a long time and thus is useful for interfacing with several other applications.
One potential method that makes unique use of our environment is to load our model into a Cassandra table. Both pickle and ONNX can produce string objects, which can be loaded into a Cassandra table with metadata, making it easy to keep track of different iterations of models. We can even store models used within the training process this way.
There are even more potential service architectures than there are ways of storing models. Theoretically and existing architecture for delivering data to users can be modified to deliver the results generated by a predictive model. The basic idea, however, is simple, leave the processing to a machine that can handle it, and don’t tie up your entire architecture serving a single request. You should have a queue of jobs and a processing method that leaves both the ingestion system and the results delivery system open even while processing is being done.
Many architectures can be used, but the first we will cover is distributed web services. The processing is split up such that different services handle requests, processing, potentially data management, and furnishing results. The request handling service receives a request and at least some of the data. It puts a job in the queue. The front job is pulled and passed to a worker. If more data is needed, that data is retrieved from elsewhere, and the worker processes using the model. The results are then loaded to a database that is accessible by the user.
Some databases like Postgres can use advanced analytics tools to perform ML processing either through the integration of python or R code or by importing a model directly from one of our formats above. PMML works well for this sort of thing if your model supports it.
It is also possible to use streaming data to furnish a pub/sub model where users subscribe to the output stream of our ml model, and the data incoming to our ml model is also streaming data.
Another choice that has to be made is whether our application does batch prediction or real-time prediction. In batch prediction, the system stores feature sets until some trigger happens, which calculates prediction for all of the data and pushes that result to some other storage mechanism that interested parties can pull from. In real-time prediction, when a request comes in, we immediately retrieve any missing required data and the model makes a prediction immediately. When making this choice, consider whether real-time prediction capabilities are necessary. Real-time prediction can be more complex and costly than batch prediction.
When doing real-time prediction, user requests determine the system load rather than by the developer, so you may need machines with more resources to keep up. Real-time also requires you to keep a closer eye on the entire infrastructure being used. Faults anywhere can result in being completely unable to furnish prediction, whereas batch models generally require less careful watch. Real-time processing can also make it more difficult to keep track of the performance of your model.
Speaking of model performance, you need to have some idea what to do in the event that the performance starts to decline. One option is to redo the entire training process with new data that reflects the current state of predictions being made. Then replace that old model in the service architecture with a new one, doing the same model deployment process. This required a lot of work and potentially abandoned insights made during the original training process. Some machine learning models have real-time training capabilities where more data can be fed into the model at any time to continue training. It is also possible to have an automatic process that trains new models over time, comparing their performance to the currently used model, and replacing old models as new ones prove to have better performance.
In conclusion, model deployment is the process of getting a trained machine learning model to do the work it was trained for. We must make many decisions about that process. Any of those choices can affect how well the model and its surrounding architecture stand up to the rigors of real-world use. The most important thing to keep in mind when deploying a machine learning model is your use requirements.
We build and manage business platforms. Is your project going south? Let’s talk for 15 minutes.
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!