Designing Internal Tools for Evaluating AI-generated Chats

With AI-generated chat becoming more popular, people expect our conversations to be not just proficient, but also engaging and dynamic. Here's how our team tackles that challenge.

Designing internal tools for AI-generated chats, written by Jessica Lascar.


Cleo is an AI financial assistant app designed to help people manage their finances.

Through chat, users engage with Cleo on a range of topics including insights on their spending, saving tips, budgeting advice, and credit score building.

Even before the widespread adoption of sophisticated language models like ChatGPT, Cleo was already harnessing the power of AI for over six years to facilitate these conversations.

In the earlier system, we classified what users were asking (known as intent) and then provided pre-written responses. However, these often fell short of natural conversations, sometimes resembling search results.

Now, with AI-generated chat becoming more popular, people expect our conversations to be not just proficient, but also engaging and dynamic. To tackle that challenge, internal tools are essential for the job.

The importance of internal tools

Internal tools are the foundation of our mission to create exceptional AI-generated chat interactions. They provide the essential space for our prompt engineers to craft prompts that effectively guide AI responses.

As our ambitions grew, it became evident that a single tool was insufficient. The complexity of creating and managing diverse prompts, assessing response quality, and ensuring a seamless user experience called for a dedicated Product Designer on our team.

Designing a set of internal tools that empower the creation of AI-generated chats presents a multifaceted challenge.

These tools play a pivotal role in ensuring that AI, such as GPT-4, understands user intent and delivers responses that are not only accurate but also engaging and contextually relevant.

Here's why they matter:

Alignment with User Expectations: Users have higher expectations of AI-driven conversations today. They anticipate responses that are not just informative but also engaging, conversational, and tailored to their needs. Designing tools that aid in prompt creation allows us to align our AI interactions with these evolving expectations.

Precision in Communication: Crafting prompts that effectively convey user queries is essential. These tools help writers fine-tune prompts to ensure that the AI comprehends the nuances of user questions, leading to more accurate responses.

Optimising AI Output: AI models like GPT-4 are powerful, but they require clear instructions to provide the desired output. Tools for prompt creation enable us to give precise guidance to the AI, resulting in responses that are on-point and user-centric.

Maintaining Brand Voice: For brands like Cleo, maintaining a consistent brand voice is crucial. Internal tools help writers infuse the right tone and personality into prompts, ensuring that AI-generated responses reflect the brand's identity.

Efficiency and Scalability: As user interactions grow, internal tools become indispensable for efficiency and scalability. They enable prompt engineering at scale, meeting the demands of a growing user base without compromising quality.

The Challenge of Maintaining Quality

AI-generated responses can impress with their precision and even inject a touch of humour when done right, all while remaining contextual to the conversation.

An example of Cleo being witty. We love it.

However, there are times when they produce unintended or unhelpful outputs, as shown in this example below:

I mean, we'd love a longer summer.

Hallucinating, much?

GPT doesn't get it right every time.

This highlights a critical issue: how do we ensure the quality of responses generated by AI, especially when the stakes are high, such as in customer support or informative interactions? 

The answer lies in harnessing human expertise to evaluate and improve these responses.

Empowering Human Oversight

To address this challenge, we have created a tool that allows human evaluators to rate the quality of each AI-generated response. This human oversight is pivotal in maintaining the high standards of our chat interactions.

Let's take a look at the evolution of this tool:

From Spreadsheet to Streamlined Interface

We started with a rudimentary system, depicted here as a simple spreadsheet:

While it served its purpose, we recognized the need for a more sophisticated and user-friendly interface. The journey from the spreadsheet to our new tool marks a significant step forward in assessing response quality:

At the top, annotators can see the user request, followed by a two-column comparison: GPT response versus Cleo response.

This side-by-side comparison allows evaluators to make a direct assessment of which response serves the user better.

In its initial iteration, we relied on a simplistic "thumbs up" and "thumbs down" rating system, which boiled down to a binary "good response" or "bad response" assessment.

However, we quickly realized the need for a more nuanced approach.

How are we doing it now?

To enhance the evaluation process, we introduced additional criteria for rating responses. Evaluators now consider factors like utility, tone of voice, factual accuracy, and the provision of sensible actions.

We even included a final question: "Which one is the better response?"

This revamped interface empowers our human evaluators to efficiently review AI-generated responses, providing ratings and feedback that play a pivotal role in refining and optimizing our chat interactions. The result is a system that not only streamlines evaluation but also significantly elevates the overall quality of our responses.

Room for Continuous Improvement

As we continue to refine our tools and processes, we acknowledge that there is always room for improvement. The journey toward perfecting AI-generated chat responses is ongoing, and we remain committed to staying at the forefront of AI technology.

Our goal is to ensure that every interaction with Cleo leaves users not only impressed with the intelligence of our chat but also satisfied with the quality of the conversation.

We even added a notes section. Neat.


Designing tools for AI-generated chats is like embarking on an exciting adventure into uncharted territory. There are very few tools out there that have been carefully designed for this purpose. It's a journey that keeps evolving, and we’re constantly refining and adapting our tools to make them better.

As we continue on this exciting path, our enthusiasm is matched by our determination to lead in AI-driven conversations. We place a strong emphasis on design, constantly pushing the boundaries of technology. Our commitment is to craft chat experiences that are not just intelligent but also beautifully designed for our users.

haha. hope you laughed like we did.

Interested in a career in AI?

Still have questions? Find answers below.
Written by

Read more

Text that says 'Tag me the money'

Tag me the money: Income Classification with AI

Cleo is on a mission to improve the world’s financial health by offering personalized insights and tools to help users reach their goals – all of which rely on a solid understanding of income.


signing up takes
2 minutes

QR code to download cleo app
Talking to Cleo and seeing a breakdown of your money.