scroll

Conversational AI Avatar

Building a real-time conversational AI avatar

When the leading brand experience agency envisioned a real-time conversational AI avatar to embody their client’s product brand, they turned to us. Their goal? To create an avatar capable of engaging users and answering their questions instantly. But here's the twist – this avatar would be an audio chatbot aiming to achieve human-like response latency, ensuring a seamless and natural conversational experience. This initial phase served as a proof of concept, demonstrating the avatar's potential to be seamlessly integrated into a website, providing real-time, intelligent responses to user inquiries.

Expertise 

Product discovery 

Backend development

Frontend development

Interactive experience

AI

Industry

Data Management

Timeline

4 weeks

Technology

Next.js

webRTC

Anthropic's Claude Haiku

Google Text-to-Speech

Elevenlabs

Vercel

AWS

Challenge

Our team faced several significant challenges during the project. The foremost was reducing response latency to ensure that conversing with the avatar felt as natural as talking to a real person. Additionally, establishing smooth audio stream input and output communication between the backend and frontend proved to be tough. Lastly, integrating a 3D avatar with moving parts like a mouth and eyes presented its own set of difficulties.

Challenge
Requirements
Challenge
Process
Technical Requirements

In just three weeks, our team transformed the concept into reality, achieving milestones along the way.

We implemented a real-time Speech-To-Text transcription system that could transcribe spoken words with an impressive delay of around 1 second. This quick turnaround ensured that users experienced a smooth and natural conversation with the avatar.

To enhance the avatar's responsiveness, we integrated Claude 3 Haiku, optimized for speed and tasks like instant customer support, with GPT-4's advanced reasoning capabilities. This powerful combination allowed the avatar to provide quick and intelligent answers, making the interaction more engaging and informative.

Our team achieved real-time Text-To-Speech audio streaming with a minimal delay of approximately 0.5 seconds. This swift response time was crucial in maintaining the conversational flow, ensuring that users felt as though they were speaking with a live person.

By introducing a "filler sentence" approach (described in the process section), we effectively bridged the gap while the system processed user queries. This tactic brought the overall latency feeling down to around 1.5 seconds, enhancing the user experience by making the avatar's responses seem more immediate.

What’s Next?

As the client is pleased with our POC results, we are not stopping and are continuing this exciting journey. On our roadmap are the next challenges:

While effective, filler sentences can't be used constantly. We are exploring additional UX strategies to further reduce perceived latency.

We are working on adding conversation memory, knowledge grounding, and moderation capabilities to provide more contextually relevant and grounded responses.

Identifying a scalable and resilient audio streaming platform that ensures high audio quality remains our priority.

Sounds interesting? Stay tuned as we will soon launch the next case study with the final project outcome!

The Challenge

The journey began with a detailed discussion with the client to define the scope and objectives of the proof of concept (POC). Our dedicated team consisted of two backend AI developers, a project manager, and ad hoc frontend developers. With just three weeks to deliver, we were tasked with creating a functional product featuring a 2D, non-animated avatar capable of responding to questions with audio output.

From the outset, the backend developers took on the complex task of implementing a streaming architecture that would enable real-time processing of audio inputs and text outputs. They carefully selected and integrated the fastest components for each phase of the pipeline—speech-to-text, language model, and text-to-speech. Their work didn’t stop there; they continuously monitored and evaluated new component releases, ready to switch to better-performing alternatives as they became available.

One of the innovative strategies devised by the backend team to reduce perceived latency involved developing a "short filler response" approach. By streaming a pre-recorded audio snippet (such as “That’s an interesting question” or “Hmm, let me think”) while querying the language model, they created the illusion of a more immediate response.

Additionally, they handled the critical task of communication and format conversion between the backend and frontend components to ensure seamless audio stream input and output. This was achieved by assuming the audio stream output would be in PCM format at a specific rate (currently 16kHz) and then implementing the necessary format conversions and communication protocols.

Meanwhile, the frontend developers played an equally vital role in the project. They focused on gathering audio from the browser and sending it to the backend, ensuring that the captured sound was transmitted accurately and efficiently. Moreover, they set up authentication mechanisms to ensure that only authorized users could access the system, adding an essential layer of security to the project.

Our Process

what partners say about us?

The code and the work were good quality and really what we were looking for. They were able to bring technical design thinking to the project. Project management was tight and I always knew what was happening.
Daniel Kiyoi
CEO & Founder Magic Dust
Apptension was flexible and professional. When I needed to quickly add capacity, it took a week or two at worst - often days. The cooperation enabled me to slowly scale up my own IT team, and the company was very helpful until the last moments of the transition.
Mateusz Oleksiuk
CEO LESS_
The technical creativity delivered by the team at Apptension was invaluable in the formation of our mixed reality start-up Hyper.
Nathan Sparshott
Co-Founder & CEO of Hyper
The SaaS Boilerplate Apptension built was a huge reason we were so successful, because all these little seemingly unrelated tasks and integrations needed to happen. If we had been working with anybody else, It would have probably taken months to do the same work.
Kwame Nyanning
CEO blkbx.
We needed their help to build the backend services and capabilities needed to deliver a production level on demand. Their ability to produce high-quality work with consistency while dealing with a new type of project was impressive.
Kelly O’Conor
Product Lead, Siberia
The project’s success has resulted in a long-term partnership for design and development. Apptension is a fantastic partner, who is willing to go above and beyond in order to deliver what the client needs.
Catarina Rocha
p(r)oud solutions
Apptension fared well in our project, working with our bespoke CMS and complex requirements. The designs were implemented well and the schedule was kept tight.
Christian Marc Schmidt
Partner at Schema
Looking for similar outcomes?

recent case studies

Conversational AI Avatar
Building a real-time conversational AI avatar
About
When the leading brand experience agency envisioned a real-time conversational AI avatar to embody their client’s product brand, they turned to us. Their goal? To create an avatar capable of engaging users and answering their questions instantly. But here's the twist – this avatar would be an audio chatbot aiming to achieve human-like response latency, ensuring a seamless and natural conversational experience. This initial phase served as a proof of concept, demonstrating the avatar's potential to be seamlessly integrated into a website, providing real-time, intelligent responses to user inquiries.

Expertise 

Product discovery 

Backend development

Frontend development

Interactive experience

AI

Industry

Data Management

Timeline

4 weeks

django
flask
react native
node.js
next.js
python
gatsby.js
vue.js
react.js
javascript
aws
docker
serverless
figma
photoshop
illustrator
after effects
firebase
blender
graphql
nuxt.js
Scss
typescript
apollo
saas boilerplate
styled components
D3
bigcommerce
bigquery
Wagtail CMS
Django Rest
redux
gsap
i18next
kubernetes
Google Cloud
Platform
rabbitmq
celery
Before
After
Prerequisites
Challenge

Our team faced several significant challenges during the project. The foremost was reducing response latency to ensure that conversing with the avatar felt as natural as talking to a real person. Additionally, establishing smooth audio stream input and output communication between the backend and frontend proved to be tough. Lastly, integrating a 3D avatar with moving parts like a mouth and eyes presented its own set of difficulties.

Challenge
Process

The journey began with a detailed discussion with the client to define the scope and objectives of the proof of concept (POC). Our dedicated team consisted of two backend AI developers, a project manager, and ad hoc frontend developers. With just three weeks to deliver, we were tasked with creating a functional product featuring a 2D, non-animated avatar capable of responding to questions with audio output.

From the outset, the backend developers took on the complex task of implementing a streaming architecture that would enable real-time processing of audio inputs and text outputs. They carefully selected and integrated the fastest components for each phase of the pipeline—speech-to-text, language model, and text-to-speech. Their work didn’t stop there; they continuously monitored and evaluated new component releases, ready to switch to better-performing alternatives as they became available.

One of the innovative strategies devised by the backend team to reduce perceived latency involved developing a "short filler response" approach. By streaming a pre-recorded audio snippet (such as “That’s an interesting question” or “Hmm, let me think”) while querying the language model, they created the illusion of a more immediate response.

Additionally, they handled the critical task of communication and format conversion between the backend and frontend components to ensure seamless audio stream input and output. This was achieved by assuming the audio stream output would be in PCM format at a specific rate (currently 16kHz) and then implementing the necessary format conversions and communication protocols.

Meanwhile, the frontend developers played an equally vital role in the project. They focused on gathering audio from the browser and sending it to the backend, ensuring that the captured sound was transmitted accurately and efficiently. Moreover, they set up authentication mechanisms to ensure that only authorized users could access the system, adding an essential layer of security to the project.

Requirements
Solution

In just three weeks, our team transformed the concept into reality, achieving milestones along the way.

We implemented a real-time Speech-To-Text transcription system that could transcribe spoken words with an impressive delay of around 1 second. This quick turnaround ensured that users experienced a smooth and natural conversation with the avatar.

To enhance the avatar's responsiveness, we integrated Claude 3 Haiku, optimized for speed and tasks like instant customer support, with GPT-4's advanced reasoning capabilities. This powerful combination allowed the avatar to provide quick and intelligent answers, making the interaction more engaging and informative.

Our team achieved real-time Text-To-Speech audio streaming with a minimal delay of approximately 0.5 seconds. This swift response time was crucial in maintaining the conversational flow, ensuring that users felt as though they were speaking with a live person.

By introducing a "filler sentence" approach (described in the process section), we effectively bridged the gap while the system processed user queries. This tactic brought the overall latency feeling down to around 1.5 seconds, enhancing the user experience by making the avatar's responses seem more immediate.

What’s Next?

As the client is pleased with our POC results, we are not stopping and are continuing this exciting journey. On our roadmap are the next challenges:

While effective, filler sentences can't be used constantly. We are exploring additional UX strategies to further reduce perceived latency.

We are working on adding conversation memory, knowledge grounding, and moderation capabilities to provide more contextually relevant and grounded responses.

Identifying a scalable and resilient audio streaming platform that ensures high audio quality remains our priority.

Sounds interesting? Stay tuned as we will soon launch the next case study with the final project outcome!

Roadmap
Solution

WHAT The CLIENT Said ABOUT this project?

what partners say about us?

The code and the work were good quality and really what we were looking for. They were able to bring technical design thinking to the project. Project management was tight and I always knew what was happening.
Daniel Kiyoi
CEO & Founder Magic Dust
Apptension was flexible and professional. When I needed to quickly add capacity, it took a week or two at worst - often days. The cooperation enabled me to slowly scale up my own IT team, and the company was very helpful until the last moments of the transition.
Mateusz Oleksiuk
CEO LESS_
The technical creativity delivered by the team at Apptension was invaluable in the formation of our mixed reality start-up Hyper.
Nathan Sparshott
Co-Founder & CEO of Hyper
The SaaS Boilerplate Apptension built was a huge reason we were so successful, because all these little seemingly unrelated tasks and integrations needed to happen. If we had been working with anybody else, It would have probably taken months to do the same work.
Kwame Nyanning
CEO blkbx.
We needed their help to build the backend services and capabilities needed to deliver a production level on demand. Their ability to produce high-quality work with consistency while dealing with a new type of project was impressive.
Kelly O’Conor
Product Lead, Siberia
The project’s success has resulted in a long-term partnership for design and development. Apptension is a fantastic partner, who is willing to go above and beyond in order to deliver what the client needs.
Catarina Rocha
p(r)oud solutions
Apptension fared well in our project, working with our bespoke CMS and complex requirements. The designs were implemented well and the schedule was kept tight.
Christian Marc Schmidt
Partner at Schema

other success stories

Riyadh Air
Building a microsite with a 3D plane model
Feelit
A SaaS mobile app designed to bridge the gap between event attendees and event makers
Creative agency
Developing an eCommerce website for a luxury brand
Mojo
A website solution for an AI-based fertility tester
mTab
Building a full-stack data analysis, visualization and storytelling platform
Platform
Ecommerce store for selling contemporary art