Best Practices

This guide provides proven strategies for building production-ready AI voice agents. Learn from real-world deployments and avoid common pitfalls.

Quick Reference

Prompt Engineering
Performance
Testing
Production

Be specific and structured. Define personality, constraints, and examples. Test edge cases.

Prompt Engineering

Your system prompt is the most critical factor in agent performance. A well-crafted prompt dramatically improves accuracy, consistency, and user satisfaction.

Structure Your Prompt

Use a clear, hierarchical structure for complex agents:

Good Example
Anti-Pattern

You are a customer support agent for Acme Corp.

# Your Role
Help customers with order inquiries, product questions, and basic troubleshooting.

# Conversation Style
- Be friendly and professional
- Use clear, concise language
- Confirm understanding before acting
- Never make promises you can't keep

# Capabilities
You can:
1. Look up order status using the check_order tool
2. Answer product questions using the knowledge base
3. Transfer complex issues to human support

# Constraints
You cannot:
- Process refunds (transfer to billing)
- Change shipping addresses after dispatch
- Make exceptions to return policy

# Example Conversation
Customer: "Where is my order?"
You: "I'll check that for you. Can you provide your order number?"
Customer: "ORDER-12345"
You: [Use check_order tool] "Your order shipped yesterday and will arrive Tuesday."

You are a helpful assistant. Help users with their questions.

Vague prompts like “be helpful” lead to inconsistent behavior. Always define specific boundaries and examples.

Define Clear Boundaries

Explicitly state what the agent should and should not do: Do This:

“If the customer asks for a refund over $500, say: ‘I need to transfer you to our billing team who can help with that.’”
“For technical issues, first confirm the customer has tried basic troubleshooting before escalating.”

Avoid This:

“Handle customer issues appropriately.”
“Escalate when necessary.”

Use Examples for Complex Behaviors

Include 2-3 concrete examples of desired conversations:

# Example 1: Happy Path
Customer: "I want to schedule a demo"
You: "Great! What day works best for you?"
Customer: "Next Tuesday"
You: [Check calendar] "I have 2pm and 4pm available. Which works better?"

# Example 2: Handling Objections
Customer: "That's too expensive"
You: "I understand. Many customers save money in the long run because [value proposition]. Would you like to see a cost breakdown?"

# Example 3: Out of Scope
Customer: "Can you fix my computer?"
You: "I specialize in scheduling and product information. For technical support, please visit support.example.com or call 1-800-TECH."

Optimize for Voice

Voice conversations differ from text chat:

Voice-Optimized
Text-Only (Wrong for Voice)

# Voice Guidelines
- Keep responses under 3 sentences when possible
- Use natural speech patterns: "I'll check that for you" not "Checking now"
- Avoid spelling out words unless asked
- Use verbal confirmations: "Got it" "Perfect" "Okay"
- Pause naturally with punctuation: periods, commas

Please wait while I process your request...
[Processing]
Here is the information you requested:
• Item 1: Details
• Item 2: Details
• Item 3: Details

Voice users can’t see text. Replace visual cues with verbal confirmations: “I found three options for you. First…” instead of bullet points.

Handle Edge Cases

Always define behavior for common edge cases: Silence or No Response:

If the customer doesn't respond for 5 seconds:
- First time: "Are you still there?"
- Second time: "I didn't catch that. Would you like to continue?"
- Third time: "I'll end the call now. Call back anytime!"

Background Noise:

If you can't understand the customer:
- "Sorry, I'm having trouble hearing you. Could you repeat that?"
- Don't guess - always confirm unclear information

Off-Topic Requests:

If asked about something outside your scope:
- Acknowledge: "That's not something I handle"
- Redirect: "But I can help you with [X, Y, Z]"
- Offer transfer: "Would you like me to connect you with someone who can help?"

Test with Real Scenarios

Use actual customer transcripts to test your prompt:

Collect 10-20 real conversations from your domain
Run test calls with the same questions
Compare responses to human agent responses
Iterate on edge cases and failure patterns

LLM Configuration

Choosing the right model and parameters dramatically affects performance, cost, and latency.

Model Selection

Choose based on your use case priority:

Low Latency (Recommended)
Complex Reasoning
Cost Optimization

Best For: Real-time voice conversations, customer supportModels:

gpt-4.1-mini (OpenAI) - Best balance of speed and quality
llama-3.1-8b-instant (Groq) - Ultra-fast, cost-effective
grok-3-mini (xAI) - Fast with good reasoning

Settings:

Temperature: 0.6-0.8
Max tokens: 150-300 (voice responses should be concise)
Service tier: Auto (OpenAI)

Best For: Multi-step workflows, technical support, decision-makingModels:

gpt-4.1 (OpenAI) - Best overall reasoning
deepseek-r1 (DeepSeek) - 30x more cost-efficient than GPT-4
llama-3.3-70b-versatile (Groq) - Good balance

Settings:

Temperature: 0.4-0.6 (more focused)
Max tokens: 500-1000 (allow detailed reasoning)
Top P: 0.9 (default)

Best For: High-volume campaigns, simple scripts, FAQ handlingModels:

deepseek-r1 (DeepSeek) - Excellent quality at 30x lower cost
llama-3.1-8b-instant (Groq) - Fast and inexpensive
gpt-4.1-nano (OpenAI) - Minimal latency, very low cost

Settings:

Temperature: 0.7 (standard)
Max tokens: 100-200 (keep responses brief)
Cache prompts where possible

Temperature Tuning

Temperature controls randomness in responses. Find the sweet spot for your use case:

Temperature	Behavior	Best For
0.0 - 0.3	Deterministic, repetitive	Exact scripts, data lookup, strict protocols
0.4 - 0.6	Focused, consistent	Technical support, compliance-sensitive conversations
0.7 - 0.8	Balanced (Recommended)	General customer service, sales, most use cases
0.9 - 1.2	Creative, varied	Personality-driven bots, entertainment, brainstorming
1.3 - 2.0	Highly creative, unpredictable	Creative writing (not recommended for voice agents)

Temperature above 1.0 can produce inconsistent or confusing responses in voice conversations. Start at 0.7 and adjust based on testing.

Testing Temperature:

Use the same test conversation 5 times at each temperature (0.5, 0.7, 0.9)
Measure:
- Consistency (do responses vary appropriately?)
- Accuracy (are facts correct?)
- Tone (does personality match your brand?)
Choose the lowest temperature that maintains natural conversation

Max Tokens Strategy

Set max tokens based on response length needs: Voice Response Guidelines:

Short answers (100-150 tokens): “Your order ships tomorrow” - Good for quick lookups
Medium responses (200-300 tokens): Explanations with 2-3 key points - Most voice conversations
Long responses (400-500 tokens): Detailed troubleshooting steps - Only when necessary
Avoid 1000+ tokens: Voice users lose attention after 30-45 seconds

For voice, shorter is better. Aim for responses under 30 seconds of speech (~200 tokens). Break complex information into multiple exchanges.

Service Tier Priority (OpenAI Only)

OpenAI’s priority tier reduces latency for real-time voice applications:

Priority Tier (Toggle): When enabled, sets vendorSpecificOptions.service_tier = "priority" for lower latency and higher throughput
Default (Priority Off): Standard latency, lower cost

The platform exposes a single checkbox: “Use priority tier (lower latency)”. Enable it for voice agents where sub-second response times are critical, or leave it off if cost is the primary concern.

Voice and Speech Optimization

Voice quality and natural speech patterns are critical for user experience.

TTS Provider Selection

Each provider has different strengths:

ElevenLabs
Cartesia
Dasha

Best For: High-quality, natural-sounding voicesStrengths:

Most natural prosody and emotion
Excellent multilingual support
Voice cloning capabilities
Fine-grained emotion controls

Settings:

Speed: 0.9-1.1x (1.0 is natural)
Stability: 0.4-0.6 (higher = more consistent, less expressive)
Similarity Boost: 0.7-0.8
Style: 0.2-0.4 (higher = more exaggerated)

Latency: Medium (~800-1200ms first byte)

Best For: Low-latency, conversational AIStrengths:

Fastest streaming latency (~200-400ms)
Built for real-time conversation
Emotion controls (anger, positivity, surprise, etc.)
Sonic model optimized for speech

Settings:

Speed: 0.9-1.2x
Emotions: Use sparingly (e.g., positivity:low for friendliness)

Latency: Very Low (best for real-time)

Voice Speed Guidelines

Adjust speed based on content complexity and audience:

Speed	Use Case	Example
0.8x - 0.9x	Complex information, elderly users	Technical support, healthcare
1.0x	Standard (Recommended)	Most conversations
1.1x - 1.2x	Simple information, younger users	Order confirmations, quick updates
1.3x+	Very simple, repetitive content	Automated announcements

Speeds above 1.3x can feel robotic or rushed. Test with real users before deploying.

Speech Recognition Best Practices

While ASR is auto-selected by BlackBox, you can optimize for it: In Your Prompts:

Ask confirmation questions: “Did you say ECHO-1234?”
Spell out ambiguous information: “That’s E as in Echo, C as in Charlie…”
Use verbal checksums: “Your confirmation code is 1-2-3-4. That’s one, two, three, four.”

Handling Misrecognition:

# In your system prompt
If you're unsure what the customer said:
- Ask for clarification: "I heard [X]. Is that correct?"
- Offer alternatives: "Did you say 'cancel' or 'change'?"
- Request spelling for important data: "Could you spell your last name?"

Conversation Design

Design conversations that feel natural and accomplish goals efficiently.

Conversation Flow Patterns

Use these proven patterns:

Greeting → Intent → Action → Close
Verification → Confirmation → Execution
Discovery → Qualification → Next Steps

Agent: "Hi! This is [Agent Name] from [Company]. How can I help you today?"
[Customer states intent]
Agent: "I can help with that. Let me [action]."
[Perform action with tool/lookup]
Agent: "Done! Is there anything else I can help you with?"
[If no] "Great! Have a wonderful day."

Agent: "I'll update your shipping address. What's the new address?"
Customer: [Provides address]
Agent: "Just to confirm: [repeats address]. Is that correct?"
Customer: "Yes"
Agent: [Updates address] "Perfect! Your address is updated."

Agent: "What brings you in today?"
Customer: [Describes problem]
Agent: "Got it. Have you tried [basic step]?"
[Determine complexity]
Agent: "This needs our specialist team. I'll transfer you now."

Turn-Taking and Interruptions

Design for natural conversation flow: Allow Natural Interruptions:

# In your prompt
- Let customers interrupt you at any time
- If interrupted, stop immediately and listen
- Don't repeat what you were saying unless asked
- Acknowledge the interruption: "Sure" or "Go ahead"

Prevent Long Monologues:

# Bad (voice only, no turn-taking)
Agent: "Your order contains item A, item B, and item C. Item A ships from warehouse 1 and should arrive Monday. Item B ships from warehouse 2 and should arrive Wednesday. Item C is backordered and will ship next week. Your total is $99.99. Shipping is free. Tracking numbers are..."

# Good (chunked with pauses)
Agent: "Your order has three items. Two ship this week, one is backordered."
[Pause for reaction]
Agent: "Would you like the tracking details?"

Error Recovery

Plan for conversation breakdowns: Misunderstanding Recovery:

Customer: [Says something unclear]
Agent: "I didn't quite catch that. Could you repeat?"
Customer: [Still unclear]
Agent: "Let me offer some options: Are you calling about [A], [B], or [C]?"

System Failure Recovery:

# If tool call fails
Agent: "I'm having trouble looking that up right now. Let me try a different way."
[Try alternative]
Agent: "I apologize, our system is slow today. Would you like me to email this to you instead?"

Scope Boundary:

Customer: "Can you help me with [out of scope]?"
Agent: "I don't handle that, but I can transfer you to [department] who can help. Would you like that?"

Testing Strategies

Systematic testing prevents production failures and poor user experiences.

Pre-Launch Testing Checklist

Test these scenarios before deploying: Happy Path (5-10 tests):

Simple request with immediate answer
Multi-step conversation (3+ turns)
Tool/function call succeeds
Transfer to human works
End conversation naturally

Edge Cases (10-15 tests):

Silence for 10+ seconds
Customer interrupts mid-sentence
Background noise (music, traffic, talking)
Customer speaks very fast or slow
Repeated misrecognition of same word
Request for something out of scope
Customer is angry or frustrated
Multiple requests in one turn
Customer changes mind mid-conversation

Failure Modes (5-10 tests):

Record all test calls and review transcripts. Common patterns in failures reveal prompt weaknesses.

Load Testing

For high-volume deployments, test concurrency:

Start small: 5-10 concurrent calls
Measure: Latency, error rate, call quality
Increase gradually: Double concurrency each round
Monitor: Watch for degradation patterns
Set limits: Configure concurrency caps based on results

See Concurrency Monitoring for monitoring tools.

A/B Testing

Compare agent versions systematically: Version A vs B:

Same agent, different prompts
Same prompt, different temperatures
Same config, different voices

Metrics to Track:

Call success rate (user achieved goal)
Average call duration
Tool call accuracy
User satisfaction (if post-call analysis enabled)
Transfer rate (lower is often better)

Sample Size: Run at least 50 calls per version before drawing conclusions.

Performance Optimization

Optimize for latency, cost, and quality based on your priorities.

Latency Reduction

Voice agents are latency-sensitive. Reduce delays: Choose Fast Components:

LLM: gpt-4.1-mini, llama-3.1-8b-instant, grok-3-mini
TTS: Cartesia (fastest), Dasha (fast), ElevenLabs (slower but high quality)
ASR: Auto-selection handles this

Prompt Optimization:

Shorter prompts = faster processing
Remove redundant instructions
Use tools instead of long context

Response Length:

Max tokens: 150-300 for voice
Concise system prompts
Discourage verbose responses

Measure end-to-end latency using the test widget. Aim for under 2 seconds from user silence to agent speech start.

Cost Optimization

Reduce costs without sacrificing quality: Model Selection:

deepseek-r1 - 30x cheaper than GPT-4, similar quality
llama-3.1-8b-instant - Very low cost per token
gpt-4.1-nano - OpenAI’s most cost-effective

Token Reduction:

Shorter system prompts (remove examples if not needed)
Lower max tokens (100-200 for simple agents)
Use tools for data lookup (don’t put data in prompt)

Caching (where supported):

Reuse common prompt segments
Cache knowledge base embeddings
Minimize unique per-call prompt variations

Quality Monitoring

Track these metrics in production: Per-Call Metrics:

Success rate (did user achieve goal?)
Call duration (outliers indicate issues)
Tool call accuracy
Number of clarification requests

Aggregate Metrics:

Daily/weekly call volume trends
Error rate by error type
User satisfaction scores (via post-call analysis)
Transfer rate (escalations to human)

Set Alerts:

Error rate > 5% in 1 hour
Average call duration > 2x baseline
Success rate < 70%
Concurrency limit reached

See Analytics and Agent Performance for monitoring dashboards.

Common Anti-Patterns

Avoid these mistakes that lead to poor user experiences:

Overly Complex Prompts

Anti-Pattern
Best Practice

You are a customer service representative with 15 years of experience working in telecommunications. You value customer satisfaction above all else and always go the extra mile. You should be empathetic, understanding, patient, kind, professional, courteous, and helpful. Always maintain a positive attitude even when customers are upset. Use active listening techniques. Employ de-escalation strategies when needed. Follow our 12-step customer service framework...

[Continues for 500+ lines]

You are a friendly customer support agent for TelecomCo.

Help customers with billing questions and plan changes.

Tone: Professional but conversational
Approach: Listen, confirm, act
Escalate: Billing issues over $100, technical problems

Why: Long prompts increase latency, cost, and confusion. Focus on essential behavior only.

Ignoring Voice-Specific Design

Anti-Pattern: Designing for text chat and expecting it to work for voice Example: Using bullet points, tables, URLs, “click here” instructions Best Practice:

Verbal lists: “I have three options for you. First, second, third.”
No visual references: “I’ll send you a link” not “Click the button below”
Spell out important codes: “Your code is A-B-C-1-2-3. That’s Alpha Bravo Charlie one two three.”

Not Testing Edge Cases

Anti-Pattern: Only testing happy path scenarios Result: Agents that fail when customers deviate from expected behavior Best Practice:

Test with real background noise
Test with fast/slow speakers
Test interruptions and silence
Test unclear requests

Over-Engineering on Day 1

Anti-Pattern: Building a perfect agent with every feature before testing Result: Months of development before user feedback, misaligned features Best Practice:

Start with minimum viable agent (basic prompt + 1-2 tools)
Deploy to limited beta users
Iterate based on real conversation data
Add complexity only when needed

Ignoring Metrics

Anti-Pattern: “Set it and forget it” - no monitoring after deployment Result: Degraded performance goes unnoticed, user satisfaction drops Best Practice:

Daily review of key metrics (success rate, errors)
Weekly review of conversation samples
Monthly prompt optimization based on patterns
Set up automated alerts for anomalies

Production Deployment Checklist

Use this checklist before going live: Pre-Launch: Launch Day:

Start with small percentage of traffic (10-20%)
Monitor metrics every hour
Have human fallback ready
Quick prompt iteration capability
Support team briefed on escalation

First Week: Ongoing:

Weekly performance reviews
Monthly prompt optimization
Quarterly voice/model updates
Regular A/B testing

See Production Checklist for detailed deployment guide.

Next Steps

Configuration: LLM Configuration, Voice & Speech
Testing: Testing Overview, Dashboard Testing
Advanced: Advanced Features, Post-Call Analysis
Deployment: Production Checklist

Introduction

Build

WebSockets

​Best Practices

​Quick Reference

​Prompt Engineering

​Structure Your Prompt

​Define Clear Boundaries

​Use Examples for Complex Behaviors

​Optimize for Voice

​Handle Edge Cases

​Test with Real Scenarios

​LLM Configuration

​Model Selection

​Temperature Tuning

​Max Tokens Strategy

​Service Tier Priority (OpenAI Only)

​Voice and Speech Optimization

​TTS Provider Selection

​Voice Speed Guidelines

​Speech Recognition Best Practices

​Conversation Design

​Conversation Flow Patterns

​Turn-Taking and Interruptions

​Error Recovery

​Testing Strategies

​Pre-Launch Testing Checklist

​Load Testing

​A/B Testing

​Performance Optimization

​Latency Reduction

​Cost Optimization

​Quality Monitoring

​Common Anti-Patterns

​Overly Complex Prompts

​Ignoring Voice-Specific Design

​Not Testing Edge Cases

​Over-Engineering on Day 1

​Ignoring Metrics

​Production Deployment Checklist

​Next Steps

​API Cross-Refs

Best Practices

Quick Reference

Prompt Engineering

Structure Your Prompt

Define Clear Boundaries

Use Examples for Complex Behaviors

Optimize for Voice

Handle Edge Cases

Test with Real Scenarios

LLM Configuration

Model Selection

Temperature Tuning

Max Tokens Strategy

Service Tier Priority (OpenAI Only)

Voice and Speech Optimization

TTS Provider Selection

Voice Speed Guidelines

Speech Recognition Best Practices

Conversation Design

Conversation Flow Patterns

Turn-Taking and Interruptions

Error Recovery

Testing Strategies

Pre-Launch Testing Checklist

Load Testing

A/B Testing

Performance Optimization

Latency Reduction

Cost Optimization

Quality Monitoring

Common Anti-Patterns

Overly Complex Prompts

Ignoring Voice-Specific Design

Not Testing Edge Cases

Over-Engineering on Day 1

Ignoring Metrics

Production Deployment Checklist

Next Steps

API Cross-Refs