SageMaker and Metaflow: Modern Machine Learning Deployment at Cleo
Everything you need to know, by the Cleo AI Data Science Team 💡
This is some text inside of a div block with Cleo CTA
CTASigning up takes 2 minutes. Scan this QR code to send the app to your phone.
Last year, our Heroku Postgres database was fast approaching its 3TB storage limit, the maximum size that Heroku supports at time of writing. To make sure our community of 4 million people could continue to work on their financial health with Cleo, it was essential to migrate away from Heroku as quickly and smoothly as possible.
After assessing a number of possible solutions, we landed on Amazon’s Aurora Postgres service. We were really impressed by Aurora’s ability to scale out to 15 read replicas supporting 64TB per database instance, more than enough to sustain our ambitious scaling goals.
In close second place was Citus Data. Though this have allowed us to shard our database across a number of nodes and move to a distributed database setup, we decided to not attempt both a major provider switch and an architecture change at the same time, given the complexity of migrating a database.
To aid with the decision making process we also invested time into benchmarking. We had two main concerns:
Through pg_bench, we assessed both of these concerns by running a number of tests involving some of our gnarliest queries against production data. These tests were run by executing queries against a production ready Aurora instance from inside a Heroku Dyno. This gave us realistic latency and performance metrics and the confidence that Aurora was the best option .
At first glance, you might think it's fairly simple to migrate one database to another, particularly when they are both running the same version of postgres and are technically already in the same data centre (AWS). Well, our hopes were quickly crushed when we realised that Heroku doesn’t grant superuser access to their postgres databases, meaning we couldn’t use logical replication and would need to find another creative solution to stream our data into another database.
Our only viable option to get our data out of Heroku was to restore a database somewhere from a base backup, and continue to stream WAL changes. Heroku massively helped us here and gave us access to the required backups and logs in S3. However, we encountered another problem: Aurora does not support this. At all. The solution? Introduce an intermediary database between Heroku and Aurora. We thought on our feet and quickly spun up our own production ready postgres instance in EC2, along with a read replica, not something we had planned to do.
Our migration plan was quickly becoming more and more complicated as it now involved migrating to EC2 and then performing a second migration onto Aurora. We were however able to automate the migration to EC2 through a series of ansible scripts which would take care of provisioning the database on EC2, downloading the base backup and continuing with WAL streaming (using wal-g). The script also handled the provisioning of a read replica and automating the promotion of the database to be our primary and switch off of Heroku. The second step (EC2 to Aurora) required the use of Amazon’s DMS service. This definitely had its quirks and was not the easiest process to automate, given it is primarily driven through the AWS console, but we’ll save that for another blog.
By the time ‘migration day’ rolled around, we had our plan set out and had performed all the testing we could think of. Just when we thought nothing else could conceivably go wrong, our error tracking software, Rollbar, experienced a major outage and we were forced to delay our migration until everything on their end was resolved.
After 2 days of refreshing Rollbar’s status page, and being confident that they were back up and running, The Great Migration’ could commence. A simplified version of our process was roughly as follows:
Of course everything went smoothly... We had one slight hiccup along the way, and we nearly fell down at the final hurdle when bringing Cleo back up once we had fully migrated to Aurora. We started accepting new writes and quickly noticed a high application error rate (always wait for your error tracking service to be online). The errors themselves were related to the fact that DMS hadn’t created the primary key sequences on each table, meaning new records were using primary keys which were already taken. Luckily for us this was a simple fix, we took Cleo back offline and created all the sequences manually.
Since The Great Migration last year, we’ve been operating smoothly with our database hosted over in AWS. We have made one change: due to the running cost of Aurora, we eventually switched to using Amazon’s Relational Database Service (RDS) , essentially Aurora without all the bells and whistles.
My advice to anyone taking on a similar migration project would be to not be afraid to delay your migration if certain aspects are not correct on the planned day. If we had gone ahead when our error tracking software was down, we wouldn't have been able to react fast enough to the increased error rates and the migration would not have been a success! It was a tough call to make, but it was definitely worth the delay to ensure everyone went smoothly.
One final note, we want to give a massive shoutout and thank you to Federico Campoli whose extensive knowledge of Postgres helped us complete the migration successfully, as well as Jesse Soyland who guided us through the motions of getting our data off Heroku.
Like the sound of what we're up to at Cleo? We're hiring! Check out our open roles.
Everything you need to know, by the Cleo AI Data Science Team 💡
Getting Cleo’s testing game tight.
Josh, Tech Lead, dives into what he looks for when reviewing pull requests at Cleo. If you're looking to write code that scales better, this is a must read.