5 min readPart 2/2 8 April

GenAI voicebots on AWS: five lessons from production

After deploying voicebots across healthcare, finance, and telecom, here are the five lessons that cost us the most time — and saved our clients the most pain.

Amazon Connect
Amazon Bedrock
Amazon Polly
Amazon Lex
Amazon Transcribe
LangGraph

Krzysztof Fudała AI Solutions Architect

In the first part of this series, we covered how voicebots work and how to deploy them on AWS. Here are the lessons that cost us the most time — and saved our clients the most pain.

1. Choose the right services — it’s all about tradeoffs

There’s no one-size-fits-all stack. Every voicebot requires balancing four factors that often conflict with each other: quality, speed, cost, and ease of use. Here are concrete comparisons we’ve evaluated:

Text-to-Speech: Amazon Polly vs ElevenLabs. Amazon Polly Neural provides good quality with easy AWS integration at a reasonable cost. ElevenLabs offers superior, more human-sounding voices — but at a higher cost and with an external dependency. For a healthcare appointment bot, Amazon Polly may be sufficient. For a luxury brand’s customer service, ElevenLabs might justify the investment.

Conversational AI: Amazon Lex vs Amazon Bedrock. Amazon Lex is easier to implement and faster to market, better for structured conversations. Amazon Bedrock provides superior conversational quality for complex, dynamic interactions but requires more development effort.

Speech-to-Text: Amazon Transcribe vs self-hosted Whisper. Transcribe is managed with good accuracy and pay-per-minute pricing. Self-hosted Whisper has a free inference cost but requires infrastructure management. At high call volumes, self-hosting might save significant money.

Every voicebot ends up with a different combination. Always analyze your specific requirements.

2. Prepare for throttling — it will happen

This is one of the most painful lessons we’ve learned. Throttling will kill your voicebot experience if you don’t plan for it. When you hit rate limits, the voicebot stutters, freezes, or fails mid-conversation.

Four throttling challenges — LLM Throttling, Elastic Container Service, Polly, Third Party

The bottlenecks are everywhere: foundation models have token-per-minute quotas; AWS Fargate has vCPU limits per region (default might be 100 vCPUs — if each task needs 2, that’s only 50 concurrent conversations); Amazon Polly has character and request rate limits; and third-party services have their own rate limits you don’t control.

Our mitigation strategies:

Four throttling strategies — Load test before launch, Request quota increases proactively, Build failover mechanisms, Have graceful degradation paths

Load test before launch. We discovered at 50 concurrent calls, we’d hit Amazon Bedrock token limits. Knowing this ahead of time, let us fix it before customers were affected.
Request quota increases proactively. Don’t wait until you hit limits. These can take days to approve. Set alerts at 70–80% capacity.
Build failover mechanisms. Use multiple AWS accounts to distribute load. Use multiple models — if one throttles, fall back to another. If ElevenLabs throttles, switch to Amazon Polly.
Have graceful degradation paths. Voice is pure real-time — you can’t pause for 30 seconds. When retry isn’t possible, end the call gracefully and offer a callback.

Plan for 10x your expected volume, not 2x.

3. Estimate costs realistically

Voicebots can get expensive at scale if you’re not careful. Here’s where your money goes:

Cost breakdown — LLM ~40%, TTS ~40-45%, STT ~10%, plus infrastructure. Total: ~$0.05–$0.50 USD/minute

The total cost ranges from approximately $0.05 to $0.50 per minute — and those small differences multiply to huge differences at scale. LLM selection alone can vary 5–10x. TTS providers can range from the cheapest to 10x more expensive.

Build a cost model before production. Monitor costs daily in the first weeks. Set budgets and alerts in AWS Cost Management. Understand your economics before you scale.

4. Automate your testing

Testing voicebots is uniquely challenging. You can’t just run unit tests — you need to test actual conversations. Manual testing doesn’t scale: imagine testing 50 scenarios after every code change. We built an automated testing approach using what we call EvalBot.

EvalBot testing architecture — Tester → Scenario → Call Management → EvalBot ↔ Telecom ↔ Telecom ↔ Bot

Here’s how it works: a tester defines a scenario with a persona and goal (for example, “a customer who politely declines a marketing offer”). EvalBot — an AI simulating a human caller — calls your production voicebot through the actual telecom infrastructure and responds naturally according to its persona. The entire conversation (transcript and audio) is recorded for evaluation.

We then evaluate goal achievement (did the bot handle the scenario correctly?), conversation quality (were responses appropriate?), and technical metrics (latency, errors, token usage, cost per conversation).

The real power: you can spin up 100 EvalBot instances simultaneously to simulate call spikes. This is how we discovered throttling issues before production. We tested 40 scenarios in 15 minutes versus hours of manual testing and caught regressions immediately when changing models.

Important caveat: each test costs money (same as a production call), and EvalBot isn’t a perfect human. You still need human testing for voice quality, emotional tone, and accessibility.

5. Listen to your clients

Four insights we’ve gathered directly from production users:

Latency is key. Users will forgive a slightly less accurate response, but they won’t tolerate long pauses. Response time makes or breaks the experience.
Realism matters. Robotic or unnatural voices create immediate distrust. Users are far more forgiving of mistakes if the voice sounds natural. Investing in high-quality TTS pays off in user acceptance.
Deploy ASAP. Get your solution in front of real users quickly. The insights from even 20 real calls are more valuable than 200 simulated scenarios. Real conversations uncover edge cases that no amount of internal testing will find.
Huge potential. For cold calls focused on obtaining marketing consent, our voicebots achieved results comparable to human agents. In some cases, customers were actually more comfortable declining to a bot than to a human. This shows voicebots have significant potential even in scenarios we traditionally thought required a human touch.

Wrapping up

Voicebots are a natural evolution of chatbots — add STT and TTS to your existing conversational AI, and you unlock entirely new channels. AWS provides all the building blocks, from managed solutions (Amazon Connect + Amazon Lex) to fully custom architectures (AWS Fargate + Pipecat + LangGraph + Amazon Bedrock).

But success requires more than choosing the right services. It requires thoughtful architecture, realistic cost planning, proactive throttling management, automated testing, and — above all — getting in front of real users early.

If your business involves phone calls, the question isn’t whether voicebots can help — it’s whether you’re building them the right way for your specific needs.

At Chaos Gears, we offer both custom development (we build it with your team, you own it) and managed service (we handle the infrastructure, you focus on conversation design). If you’re exploring voicebots, let’s talk.