Expertise
Product discovery
Backend development
Frontend development
Interactive experience
AI
Industry
Data Management
Timeline
4 weeks
Technology
Next.js
webRTC
Anthropic's Claude Haiku
Google Text-to-Speech
Elevenlabs
Vercel
AWS
Our team faced several significant challenges during the project. The foremost was reducing response latency to ensure that conversing with the avatar felt as natural as talking to a real person. Additionally, establishing smooth audio stream input and output communication between the backend and frontend proved to be tough. Lastly, integrating a 3D avatar with moving parts like a mouth and eyes presented its own set of difficulties.
In just three weeks, our team transformed the concept into reality, achieving milestones along the way.
We implemented a real-time Speech-To-Text transcription system that could transcribe spoken words with an impressive delay of around 1 second. This quick turnaround ensured that users experienced a smooth and natural conversation with the avatar.
To enhance the avatar's responsiveness, we integrated Claude 3 Haiku, optimized for speed and tasks like instant customer support, with GPT-4's advanced reasoning capabilities. This powerful combination allowed the avatar to provide quick and intelligent answers, making the interaction more engaging and informative.
Our team achieved real-time Text-To-Speech audio streaming with a minimal delay of approximately 0.5 seconds. This swift response time was crucial in maintaining the conversational flow, ensuring that users felt as though they were speaking with a live person.
By introducing a "filler sentence" approach (described in the process section), we effectively bridged the gap while the system processed user queries. This tactic brought the overall latency feeling down to around 1.5 seconds, enhancing the user experience by making the avatar's responses seem more immediate.
What’s Next?
As the client is pleased with our POC results, we are not stopping and are continuing this exciting journey. On our roadmap are the next challenges:
While effective, filler sentences can't be used constantly. We are exploring additional UX strategies to further reduce perceived latency.
We are working on adding conversation memory, knowledge grounding, and moderation capabilities to provide more contextually relevant and grounded responses.
Identifying a scalable and resilient audio streaming platform that ensures high audio quality remains our priority.
Sounds interesting? Stay tuned as we will soon launch the next case study with the final project outcome!
The journey began with a detailed discussion with the client to define the scope and objectives of the proof of concept (POC). Our dedicated team consisted of two backend AI developers, a project manager, and ad hoc frontend developers. With just three weeks to deliver, we were tasked with creating a functional product featuring a 2D, non-animated avatar capable of responding to questions with audio output.
From the outset, the backend developers took on the complex task of implementing a streaming architecture that would enable real-time processing of audio inputs and text outputs. They carefully selected and integrated the fastest components for each phase of the pipeline—speech-to-text, language model, and text-to-speech. Their work didn’t stop there; they continuously monitored and evaluated new component releases, ready to switch to better-performing alternatives as they became available.
One of the innovative strategies devised by the backend team to reduce perceived latency involved developing a "short filler response" approach. By streaming a pre-recorded audio snippet (such as “That’s an interesting question” or “Hmm, let me think”) while querying the language model, they created the illusion of a more immediate response.
Additionally, they handled the critical task of communication and format conversion between the backend and frontend components to ensure seamless audio stream input and output. This was achieved by assuming the audio stream output would be in PCM format at a specific rate (currently 16kHz) and then implementing the necessary format conversions and communication protocols.
Meanwhile, the frontend developers played an equally vital role in the project. They focused on gathering audio from the browser and sending it to the backend, ensuring that the captured sound was transmitted accurately and efficiently. Moreover, they set up authentication mechanisms to ensure that only authorized users could access the system, adding an essential layer of security to the project.
Expertise
Product discovery
Backend development
Frontend development
Interactive experience
AI
Industry
Data Management
Timeline
4 weeks
Our team faced several significant challenges during the project. The foremost was reducing response latency to ensure that conversing with the avatar felt as natural as talking to a real person. Additionally, establishing smooth audio stream input and output communication between the backend and frontend proved to be tough. Lastly, integrating a 3D avatar with moving parts like a mouth and eyes presented its own set of difficulties.
The journey began with a detailed discussion with the client to define the scope and objectives of the proof of concept (POC). Our dedicated team consisted of two backend AI developers, a project manager, and ad hoc frontend developers. With just three weeks to deliver, we were tasked with creating a functional product featuring a 2D, non-animated avatar capable of responding to questions with audio output.
From the outset, the backend developers took on the complex task of implementing a streaming architecture that would enable real-time processing of audio inputs and text outputs. They carefully selected and integrated the fastest components for each phase of the pipeline—speech-to-text, language model, and text-to-speech. Their work didn’t stop there; they continuously monitored and evaluated new component releases, ready to switch to better-performing alternatives as they became available.
One of the innovative strategies devised by the backend team to reduce perceived latency involved developing a "short filler response" approach. By streaming a pre-recorded audio snippet (such as “That’s an interesting question” or “Hmm, let me think”) while querying the language model, they created the illusion of a more immediate response.
Additionally, they handled the critical task of communication and format conversion between the backend and frontend components to ensure seamless audio stream input and output. This was achieved by assuming the audio stream output would be in PCM format at a specific rate (currently 16kHz) and then implementing the necessary format conversions and communication protocols.
Meanwhile, the frontend developers played an equally vital role in the project. They focused on gathering audio from the browser and sending it to the backend, ensuring that the captured sound was transmitted accurately and efficiently. Moreover, they set up authentication mechanisms to ensure that only authorized users could access the system, adding an essential layer of security to the project.
In just three weeks, our team transformed the concept into reality, achieving milestones along the way.
We implemented a real-time Speech-To-Text transcription system that could transcribe spoken words with an impressive delay of around 1 second. This quick turnaround ensured that users experienced a smooth and natural conversation with the avatar.
To enhance the avatar's responsiveness, we integrated Claude 3 Haiku, optimized for speed and tasks like instant customer support, with GPT-4's advanced reasoning capabilities. This powerful combination allowed the avatar to provide quick and intelligent answers, making the interaction more engaging and informative.
Our team achieved real-time Text-To-Speech audio streaming with a minimal delay of approximately 0.5 seconds. This swift response time was crucial in maintaining the conversational flow, ensuring that users felt as though they were speaking with a live person.
By introducing a "filler sentence" approach (described in the process section), we effectively bridged the gap while the system processed user queries. This tactic brought the overall latency feeling down to around 1.5 seconds, enhancing the user experience by making the avatar's responses seem more immediate.
What’s Next?
As the client is pleased with our POC results, we are not stopping and are continuing this exciting journey. On our roadmap are the next challenges:
While effective, filler sentences can't be used constantly. We are exploring additional UX strategies to further reduce perceived latency.
We are working on adding conversation memory, knowledge grounding, and moderation capabilities to provide more contextually relevant and grounded responses.
Identifying a scalable and resilient audio streaming platform that ensures high audio quality remains our priority.
Sounds interesting? Stay tuned as we will soon launch the next case study with the final project outcome!