Successful data science projects typically need to have a process for building, improving, and making the actual machine learning models on a continual basis. Organisations often build effective models, but then struggle to follow through with production. The problem is so acute that some organisations have hired engineers solely to bring machine learning from proof-of-concept to reality. In this blog post, let's talk about the technologies we’ve been implementing at Cleo to make our machine learning deployments simple yet robust.
For some time, we’ve been using Amazon SageMaker to train models in the cloud and deploy them as micro-services. For those unfamiliar with it, SageMaker is Amazon’s machine learning training and deployment service. By making use of Docker and separating the data from the code, SageMaker allows efficient training of models where the training instances are only open for as long as needed to train the model. Once the model is built, it allows single-click deployment to an endpoint, which our backend services can then query whenever they need an inference. You can read more about Cleo’s use of SageMaker in this guest blog for AWS.
SageMaker tracks data versions across the later steps of training and deployment, but it doesn’t capture details of the data extraction or preprocessing, and it doesn’t have an easy way to relate each endpoint back to the source code that went into it. Until recently we were operating SageMaker through an untidy and poorly versioned collection of bash and Python scripts, mixed in with some old-fashioned clicking through the AWS console.
Enter Netflix Metaflow. Open-sourced and released to the public in December 2019, Metaflow is a Python package that can be used to wrap both the training and deployment workflows of an organisation’s machine learning models. It arrived just as we were looking to harden our deployment and version control and increase our iteration speed. We became early adopters.
We were able to get up and running early in the new year with our first model. We moved the deployment code into a single script — a Metaflow ‘flow’ — with Git and Docker commands handled through their Python SDKs.
With a single command we can now:
- extract a new dataset
- perform preprocessing
- train a number of new models to get the best hyperparameters
- evaluate the model
- deploy the model as an endpoint
- test that endpoint’s performance
Each time the flow runs, it increments the version number and logs all of the parameters, hyperparameters, and datasets of the run. It also tags our Github and Docker repositories and all our SageMaker cloud objects with the version number, so we can join the dots between them later.
Metaflow also comes with inbuilt AWS integration: instead of local storage, objects can be persisted to Amazon’s data store, S3. And it comes with a fast S3 client that runs in the background uploading all the logs and data. That’s a big help for our collaboration — it makes it easy to pick up where a colleague has left off, and allows multiple people to work on the same application without getting in each others’ ways.
Building and deploying a new model with a single command means we can focus on data science rather than engineering, and makes switching between models and testing hypotheses speedy and fun.