What's the big deal about open source AI?
A quick, non-technical look at open AI models and why the matter so much
This is some text inside of a div block with Cleo CTA
CTASigning up takes 2 minutes. Scan this QR code to send the app to your phone.
Cleo is on a mission to improve the world’s financial health by offering personalized insights and tools to help users reach their goals – all of which rely on a solid understanding of income.
Cleo is on a mission to improve the world’s financial health by offering personalized insights and tools to help users reach their goals – all of which rely on a solid understanding of income.
To accomplish this, we built a model that predicts how likely an individual transaction is to be earned income. The good news is that we had a lot of transaction data to train a model with, but we faced a challenge – our transaction data lacked any labels.
We’re lucky to have the option of getting labels by annotating the data using an internal tool, but that takes forever and therefore, costs us more cash. So before committing to scaling up annotations, we decided to build a proof of concept to test our hypothesis that this was even possible and hence worth investing time in.
In some instances, identifying income is straightforward. Transactions descriptions including 'payroll' are earned income (duh).
We began by crafting simple rules to label these obvious cases, providing a foundation upon which to build. However, the real world is rarely so straightforward.
Think about that $42 Walmart transaction – is it a paycheck or just a refund? These grey areas need more advanced techniques.
To capture some of this real world messiness, we used third-party categories as an extra feature to decide the labels. Now, we could train a model on the large dataset we generated from the unlabelled data.
The text-based representation of the input to our model is a perfect use case for a BERT-based model (which stands for Bidirectional Encoder Representations from Transformers) with a classification head.
While it was a step forward, we knew that our ‘easy cases’ heuristics could introduce bias and that the third-party labels definitely had their flaws.
We wanted to do better because we’re big-brained like that 💅
This is where human annotations stepped in as a valuable source of high-quality training data. However, to save time and cost, we only collected a small sample.
So, how do we maximize their impact?
Our answer to this was to use the model we built earlier (in step 2) and evaluate it against the annotations (from step 3).
We could then inspect false positives and false negatives, highlighting common misclassifications to refine the heuristic rules to improve the label quality. This iterative process, known as bootstrapping, is the secret sauce.
We retained a holdout test set of the annotations (the real ground truth) to ensure our final model evaluation remained unbiased.
So, that’s how we transformed a large, unlabelled dataset into a labelled one, using a relatively small number of annotations.
The benefits were twofold: we created a valuable resource for training our machine learning model, and we gained a deeper understanding of our income classification task by working with the annotators to clarify labelling rules around edge cases.
A quick, non-technical look at open AI models and why the matter so much
We know that in the end AI will be a good thing for us humans. But that doesn't mean we don't need to be careful as we innovate along the way.
There's a whole new industry that's building virtual version of people based on the data they've left behind. Being able to keep some memory of your loved ones alive is profound, but is this actually a good thing?