Build a production voice support agent with ElevenLabs and RAG

Problem

Building a voice agent that can answer customer support questions with natural-sounding speech requires stitching together speech-to-text, an LLM, retrieval-augmented generation, and text-to-speech. Each component has latency, accuracy, and cost tradeoffs. Doing this from scratch takes weeks of audio pipeline engineering, and the result often sounds robotic or hallucinates answers.

Solution

Step 1: Set up ElevenLabs Conversational AI 2

Create a new agent on the ElevenLabs platform with a natural voice:

1. Go to elevenlabs.io/app/conversational-ai
2. Create a new agent
3. Select a voice (e.g. Australian accent)
4. Choose Gemini Flash 2.5 as the reasoning model

Step 2: Write the base system prompt

Use Claude to draft a thorough system prompt from your marketing site content:

# System prompt for the voice agent
You are a friendly support agent for [Company Name].
Answer customer questions based on the provided knowledge base.
Keep responses conversational and concise (under 3 sentences).
If you do not know the answer, say so and offer to connect
the customer with a human agent.
Do not make up product features or pricing.

Step 3: Configure RAG with your knowledge base

Upload your support documentation as the agent's context:

1. Export your help docs (Zendesk, Notion, etc.) as text files
2. Upload to the ElevenLabs knowledge base
3. Add your marketing site URL as a context source
4. Enable RAG in the agent settings

Step 4: Test and deploy

# ElevenLabs provides a shareable test link:
https://elevenlabs.io/app/talk-to?agent_id=agent_YOUR_ID

# Embed in your app via the JavaScript SDK:

<script src="https://elevenlabs.io/convai-widget/index.js" async></script>
<elevenlabs-convai agent-id="agent_YOUR_ID"></elevenlabs-convai>

Why It Works

ElevenLabs Conversational AI 2 handles the entire audio pipeline: speech-to-text, LLM reasoning with RAG, and text-to-speech with natural voice synthesis. Gemini Flash 2.5 provides fast reasoning with low latency, which is critical for voice conversations where pauses feel unnatural. RAG grounds the agent's responses in your actual documentation, reducing hallucinations. The interrupt detection ensures natural turn-taking in conversation. This approach takes hours instead of weeks to ship.

Context

ElevenLabs has a free tier with limited minutes for testing
Gemini Flash 2.5 was chosen for speed; voice agents need sub-second LLM responses to feel natural
Export help docs from Zendesk using Airbyte or simpler export tools, then upload as text files
Sesame (sesame.com) is an alternative for ultra-realistic voice but with fewer RAG integrations
Play.ht is another option but has been reported as lower quality for conversational use cases
Cost scales per minute of conversation; monitor usage in production
Test with different accents and background noise levels before deploying to customers