Life at Cleo

A Year of LLM Developments at Cleo

2023 was a big year for LLMs and generative AI. It was also a big year for Cleo. Let’s break it down by season.

A year of LLM developments at Cleo.

Nina, our Senior Content Designer and Chat Experience Lead, gives us a rundown of how we made Cleo smarter by harnessing the latest developments in LLMs throughout the year. 

Why LLMs?

Our aim is for users to be able to chat with Cleo about all things finance, and come away feeling more confident about handling their money. With the latest developments in LLMs, Cleo has become much smarter, much more quickly. LLMs let Cleo better understand context – that is, what topic you were just chatting about along with information like your budget status, what subscription products you’re on, and more. It all helps Cleo see the bigger picture when it comes to helping you with financial health – and we’re only just getting started.

Here’s a breakdown by season of how we innovated upon our existing chat technology.


At Cleo, we’ve been building our own LLMs for the past 7 years. When ChatGPT launched to the public in winter 2022, our data scientists started experimenting with it to see which models could work with Cleo.

Around this time, we started building out an internal chat squad. At first, this was a bunch of data scientists plus a couple of backend engineers. Eventually, we brought in a conversation designer, a product designer and user researchers to form a full squad focused on Cleo’s chat experience. 

Next: Operation Get Cleo Employees Hyped. Our Firebreak week theme for Winter 2022 was ChatGPT integration. Everyone in the company got their heads together to come up with different ways to implement ChatGPT within the Cleo app.


In the Spring, we started an annotations project in-house. We assessed what users were typing to Cleo in the app, and we’d identify the categorization given to the request. As part of this project, we were able to suggest new user intents, helping Cleo to give more relevant, specific and personalized responses to users - enhancing the user experience. This project marked our first step towards creating responses for absolutely anything that someone could ask Cleo. And it went well! We saw a 62% reduction in intent recognition errors off the back of this project. We’d created a chat functionality that could understand a user’s intent - with wider context. 

We realized that the best way to clarify intent is sometimes just to re-ask a user what they meant. So, we implemented our first GPT use case: no_command_found. Essentially, this is using an LLM to ask users to clarify what they are asking or talking about, if we cannot identify their intent. Adding in a triage point really helped the user experience, as they could get to exactly what they were looking for. 

We also launched gpt_everywhere with a small population of users. This involved plugging LLMs into a small selection of chat interactions with Cleo. The LLM looks at what Cleo and the user have chatted about, and assesses our pre-written responses (mentioned above). Using all of this information, Cleo is then able to provide a response that accounts for the context of her recent conversation with a user.

One of the biggest things we implemented in the Spring was introducing our FAQ service into Cleo’s chat. We essentially brought our Intercom help center into Cleo’s chat by using FAQ articles for support. We use the FAQ articles in prompts for LLMs, so Cleo’s responses to a user are up to date when it comes to features, policies etc. This took a lot of work, especially when it came to brushing up FAQ articles to be readable by LLMs.

Through these projects, we were able to see a 9% reduction in customer support contact rate and a 2.5x increase in unique requests to Cleo, which means people were typing more and more unique things to converse with Cleo. This was very exciting stuff for our team.

We recognized quickly that we needed to build a prompt builder and test the tool in-house. Building this tool took a big effort from the internal design team, and we now iterate on it regularly. 


Summer came around, and we kicked off yet another project. 

This time it was around GPT annotations. We looked at the responses that CleoGPT sent to users, and used a set of evaluators (looking at things like quality, tone of voice and accuracy) to judge the GPT-generated response. This was important as it allowed us to pinpoint exactly how GPT responses could improve. Again, this project involved people from across the company, and involved more internal tool building. At one point, we were doing 1,500 annotations per week, all balanced with squad work too. 

Because gpt_everywhere was more generic, we wanted to start experimenting with intent-specific prompts. We could then put different information into prompts, depending on the intent, to ensure we meet user needs. We found that doing this made Cleo more empathetic to a user’s specific needs, which is obviously something we love to see. 

In the summer, we launched our second internal chat squad. They are focused on spending insights, and specifically how we play a users’ spending habits back to them in a useful way, using GPT to help. 


Well, Autumn came around quick!

Our experimentation with fine-tuning GPT on a lower model began. So what is a fine-tuned GPT model? It’s when we use an earlier model, like GPT 3.5, and train it for a specific task (like talking to people about cash advances). Basically, it’s like we went from using a one-size-fits-all hat (GPT-4, in this case) for everything, and instead used a series of more specific hats for more specific tasks (fine-tuned GPT-3.5). By using a less advanced and cheaper LLM, we saved a bunch of money, freeing up resources to develop other things. But it’s not over. This is an ongoing project which involves lots of collaboration between content design and data scientists.

With the introduction of different models in the same app, we  needed a way to properly test how different versions of prompts behave. We wanted to find out how outputs would differ if we changed the copy, ‘temperature’ (of Cleo’s tone of voice), or the model we were using. We built yet another tool in-house–this time an AB testing tool, which allows us to test and action any hypotheses quickly.

We’re taking this another step further, by training LLMs to evaluate outputs from other LLMs in something called ‘automated evaluation’.For example, we have a strictly trained LLM looking at chat responses, specifically assessing quality. This lets us ‘simulate’ an interaction between Cleo and a user at scale, essentially creating thousands of versions of the same interaction and using an LLM to evaluate how frequently certain issues might pop up, like incorrect style or a too-intense tone of voice.

What’s Next?

We’re still building different steps of the scientific process, and developments will take time. The ultimate goal when building in-house tooling is ensuring that all tools are user-friendly, with minimal onboarding needed. 

We plan to continue with experimentation, following our instincts to do the right thing. We’ve learned so much by collaborating as a team of product designers, data scientists, analysts, engineers, product managers, user researchers, and content designers. d It feels great to know we can rely on our awesome Cleo colleagues to build the tools we need, and create more great outcomes for our users.

Helping users build their financial health is getting easier and better with the use of LLMs. 

Still have questions? Find answers below.

Read more

signing up takes
2 minutes

QR code to download cleo app
Talking to Cleo and seeing a breakdown of your money.