call-i
My 2024 hackathon participation - won MacBook
We won a hackathon in front of 3000 people and the CTO of SAP gave us our prizes!!
This post is a more technical documentation.
Challenge
Our challenge title was "Customer Chatbots with LLMs" in the context of the SAP ecosystem.
Our Approach
In 2024, people are already used to excellent chatbots like ChatGPT. Therefore, we wanted to present something most people have never experienced before: a voice chatbot - an automated call center.
Why?
- Wow-Factor: A voice chatbot can be presented in a short demo and really impress the audience
- Cost Reduction: The jury for the first stage of the hackathon was a group of 30 CIOs, who are mainly concerned with keeping costs low. Human call centers are an immense cost factor for companies. If we could replace them with a voice chatbot, we could save a ton of money.
- Customer Experience: Voice chatbots are available 24/7 with no waiting time.
How I built it
The following steps need to be performed to have a real human-AI interaction via voice:
- Voice Detection: The user speaks, speech is detected and transcribed to text
- Answer Generation: The text is sent to an AI model, which generates an answer
- Speech Synthesis: The answer is converted to speech and played to the user
It is super important that the latency between the user speaking and the AI answering is as low as possible. Therefore, it is necessary that the speech is transcribed while the user is speaking and instantly streamed to the AI model. The response of the AI model has to be streamed to the speech synthesis which needs to be streamed to the user.
Streaming means sending the data in small chunks instead of waiting for the whole data to be available. Best example is ChatGPT, where you can see the text while the model creates it
Voice Detection
For voice detection, we used the browser's built-in speech recognition API. The user speaks into the browser, the speech is transcribed to text and streamed to the AI model. Ideally you would use a more capable speech-to-text-model like Whisper.
Answer Generation
Here GPT-4 hosted on an Azure instance, connected to a SAP system, is used to generate the answer and trigger actions like sending an email or calling an API.
Speech Synthesis
The GPT-4 answer is streamed to elevenlabs, which converts the text to speech and streams it back to the user.
Techstack
- Elevenlabs: Speech Synthesis
- AzureOpenAI: Answer Generation, Function Calling
- Python
- Docker
- React
For the demo, voice detection was performed in the browser. (Whisper was not streamable and led to long latencies)