Cleo is a chat-native company. Back before we built our app, she talked to users with Facebook messenger. Chat is a primary way users interact with Cleo’s features.
So what do users chat to Cleo about? Users get insights about their spending that are presented with humor, for example “roast mode”. They can check their credit scores, ask for a cash advance, or check how much they have left to spend before payday. Users can also get help with customer support issues, ask how to connect new accounts or payment methods, or set up a subscription.
For several years, the way we understood users’ questions was to first categorize the intent of the message, using a combination of machine learning and hand-built rules. The intent, along with the user’s state, fed into a decision tree to pick one of thousands of pre-written templates, which was then rendered into Cleo’s response.
However, Cleo’s messages from pre-written templates didn’t always sound like a conversation or refer directly to what the user had typed in. For some topics, chat worked more like a search engine for relevant information or links. Now, with AI chat on the rise, people have higher expectations of the conversation.
The benefits of large language models
Natural language understanding
Large language models (LLMs) can consider a large context window of previous conversation and use it to understand the most recent question, enabling them to understand phrases like, “How will that help me?” or, “But it’s not working”. And by fine-tuning on Cleo’s conversations, LLMs could understand our users even better.
Natural language generation
LLMs can generate responses that make sense in the context of the user’s question. There are way more subtly different questions that users can ask than we could write responses for.
Can we sub in GPT-4?
How can we use the latest advances in LLMs, such as GPT-4, to improve Cleo chat? Can we just sub in GPT-4 for Cleo to answer users’ questions?
It depends what users are asking for.
For some questions that people ask Cleo, yes GPT-4 gives a great response. With some generic information in the prompt, GPT-4 can answer questions like, “Hi”, or “What can this bot do?”. It’s also not bad at answering general knowledge questions about personal finance topics.
For other questions, we don’t want GPT-4 to step in because it will give wrong information. “Can I get a refund”, “Transfer some $10 out of my wallet”, “What stocks should I buy”. It’s happy to make up a new refund policy. It would tell customers that it had transferred some money, and then nothing would happen. And due to regulation, we don’t want to provide advice on investments.
There’s also an area between, where GPT-4 can generate a great answer to the user’s question but only if the prompt has the right information about Cleo and the user. “How do I connect my card”, “How much did I spend at restaurants last month?”
How can we actually use large language models?
When users type out questions to Cleo, we don’t send them straight to GPT-4. But we can use GPT-4 and other LLMs to help craft the response by doing something more nuanced:
Step 1: classify the questions that users type out according to whether we want GPT-4 to generate an answer, and if so, what category of information do we need. This is intent classification, which we’ve already been doing, but with fine-tuned LLMs we can do even better.
We still need class-labeled training data just like any traditional classification task, but because we’re using pre-trained language models, we don’t need as many examples per intent.
Step 2: retrieve the relevant information. Some of the retrieval might just be getting a relevant fact from our database or from other ML endpoints. For customer service questions, we’re using semantic similarity search to find the most relevant paragraphs from our lengthy customer service knowledge base. LLMs can also help with this step by generating embeddings for semantic similarity search. More on that in another blog post, soon.
Step 3: craft the response. For some intents, we’d always use a pre-written response, because we’re happy with how Cleo responds and using a generative model would risk messing it up. This includes “verbal buttons” like “roast me” where the user is giving a specific command or activating an Easter Egg. With other intents, we want to condense several paragraphs of possibly-relevant information into a short response that makes sense. This is where the latest instruction-following models like GPT-4 come in.
These three steps aren’t a vision statement. They’re what we’ve actually been able to put into production and AB test within a few short months.
Actions: Users ask Cleo to take actions, not just answer questions. A chatbot that can ace a pub quiz can’t sub in for Cleo much of the time. Even worse, it might lie by saying it’s taking an action, and then not follow through with it.
Privacy: We wouldn’t send Cleo conversations to any old third-party API. Partners have to pass our due diligence process and security requirements. This limits how many different proprietary models we can try, but it’s crucial for our users’ privacy.
Latency: The response time of a third-party API can fluctuate with no warning or explanation. So, we have to remain open to switching to different models or providers to find better latency, or self-hosting models so that we have more control.
Tone of voice: we have a great corpus of hand-written messages with Cleo’s tone of voice, crafted by our team of writers and comedians. The instruction-following GPT models we’re using to generate responses in Step 3 above are not possible to fine tune, which risks making the chatbot more bland. We don’t like bland.
Large language models are bringing a step change to the quality of Cleo’s chat. There are so many directions to improve the product, but some we have in mind:
- Enabling users to interrogate their data with a wider variety of questions about their spending history
- Remembering previous conversations to help users stay on track with their goals, including generating notification messages
- Serving buttons for actions that the user can take, which make sense alongside generated text
If this sounds exciting, we’re currently looking for NLP Data Scientists to join our team.
*OpenAI is not used in conversations related to the credit builder card, issued by WebBank.