Add Some Sparkle to Feature Engineering with Spritz
An introduction to Spritz, Cleo's feature engineering framework.
IN THIS ARTICLE:
In a previous blog post, we explained how Espresso ☕, Cleo’s MLOps Framework, became the solution at Cleo to serve, deploy, and monitor machine learning services.
Espresso covers a lot of aspects of our services lifecycle, from development to deployment and monitoring. Its monitoring capabilities allowed us to track down bottlenecks in our code and fixing them brought to up to 50x speedup 🏎️ in some of our services.
🔍 We recently introduced Espresso model monitoring, stay tuned for new blog posts about it.
This is great indeed, but we soon realized there were bottlenecks we were not able to optimize and they were all of the same nature: computing features.
ingredients of our predictions 🍱
But let’s take a step back. What are features?
At this point, you should know how much we like culinary analogies: imagine the predictions, the outputs of our machine learning models, as delicious pizzas.
Even with a great recipe, you can bet you need top tier ingredients to ensure a great result: these are the features, inputs for machine learning models used to bake your predictions.
Let’s say you want to build a machine learning model capable of classifying fruits. We can tell from our human perspective there are some attributes of fruits that can help the model in this task.
If you don’t agree, think how many blue lemons 🍋 you've seen in your life. Probably none, that’s why color is a good feature for your model.
But also, bananas 🍌 are yellow. That’s why you might need another feature, like acidity. And so on.
A feature is an individual characteristic of what you are observing, like color for fruits, max temperature for a day, or balance for a bank account.
The problem ⚠️
Nice, we know now what features are, but what does it mean to compute them and why it is such a big pain?
Features, as all the best ingredients in your kitchen, need to be prepped before to cook the meal. This activity is called feature engineering and since features are computed out of data, it involves connections to databases to fetch raw data you need.
Let’s think about the example of max temperature for a day, to compute this feature you would need all the temperature measures for that day which we can imagine stored in a database. And you need to aggregate them to get the maximum. If you have a lot of measures, this might be a problem already in terms of latency, and yeah, this is really a super simple feature.
Can't you feel the need for low latency? Sure, during training phase for your model you can spend days or weeks in your kitchen dicing your onions for the perfect brunoise to use in your meal. But serving phase is like a restaurant and latency is king. You can’t really make your customers wait while you go to the supermarket to get some onions you will have to dice to prepare that fine meal they ordered.
You need to prepare your ingredients before. Sometimes this works, and lot of your features can be computed in advance with batch feature engineering. But did you ever eat a cold burger or 1 month old vegetables? Each ingredient has its own definition of freshness, and so are features.
Imagine you want to detect fraud, you need super fresh features to be able to classify them as fraudsters. If you want to keep also the latency of your prediction under control, the only solution is to ask someone else to prepare those ingredients for you continuously, no matter if you are going to use it now. That’s maybe wasting food in kitchen, but in data engineering we call it streaming feature engineering.
Our solution: Spritz🍹
At Cleo it’s plenty of critical use cases where it’s just too hard to find a tradeoff between prediction latency and features freshness (which impacts prediction accuracy): sometimes we need both and that’s why we built Spritz 🍹.
Spritz is our feature engineering framework, supporting both batch and streaming use cases. The idea behind it is very simple: we want to define feature transformations and let Spritz run them to compute our features for us, supporting freshness in the order of days or seconds.
Streaming use cases are implemented with Spark jobs responding to events stored in Kafka, while we are relying on Airflow for batch use cases. In both cases features finally end up in our central feature store.
Feature store is like a pantry for our ingredients, a centralized solution to store features and their metadata in a way that is easily accessible for both training and serving use cases.
At Cleo, we think easiness is a fundamental pillar of all platform solutions: that’s why we decided to adopt Feast approach, relying on solid battle-tested data solutions for storing our features, while we expose access to our feature store with a single programmatic interface (extending Feast to add more functionalities such as fallback).To cover both training and serving scenarios, the feature store is actually composed of two different solutions:
offline store, which allows time travelling and the retrieval of features values at a specific timestamp. Access is optimized for analytics and training use cases.
online store, which allows low-latency access to the latest version of a feature, with the possibility to fallback to other computation mechanisms in case of violation in freshness objective.
The workflow with Spritz is pretty easy:
Developers define features and transformations in a git repository, just using Python. This allows us to have a centralized repository for features to incentivise reusability while we are managing features code as any other production-grade code.
CI/CD takes care of updating features metadata in our central feature store, spawning Spritz jobs to perform transformations.
Spritz jobs write data in feature store to make them accessible by services for model serving and developers for training or analytics purposes.
That’s all, really! We added some sparkle to our feature engineering game with Spritz and we are already collecting the fruits of our choices:
1. Spritz enabled new use cases requiring features computed in response to events describing a change in state for our object (eg. new bank transaction for a user). This is a new class of solutions we weren’t able to support before, with features freshness in the order of few seconds
2. Thanks to single-digit milliseconds latency online store features retrieval, we were able to speed-up one of our service by 10x with room for more improvements!
Cleo just wrapped our first out-of-home multi-faceted marketing campaign, Money Main Character. It was a massive effort, involving multiple moving pieces that took place within a span of just two months. How did we accomplish this? Here’s a timeline and thoughts from our team
Let's face it, as a typical Gen-Z I'm not gonna be jumping on the property ladder anytime soon. For me, the workplace is less about becoming a baller and more about finding genuine value. Bit biased, sure, but here's why Cleo is just that.