Best Practices
This guide provides proven strategies for building production-ready AI voice agents. Learn from real-world deployments and avoid common pitfalls.Quick Reference
- Prompt Engineering
- Performance
- Testing
- Production
Be specific and structured. Define personality, constraints, and examples. Test edge cases.
Prompt Engineering
Your system prompt is the most critical factor in agent performance. A well-crafted prompt dramatically improves accuracy, consistency, and user satisfaction.Structure Your Prompt
Use a clear, hierarchical structure for complex agents:- Good Example
- Anti-Pattern
Define Clear Boundaries
Explicitly state what the agent should and should not do: Do This:- “If the customer asks for a refund over $500, say: ‘I need to transfer you to our billing team who can help with that.’”
- “For technical issues, first confirm the customer has tried basic troubleshooting before escalating.”
- “Handle customer issues appropriately.”
- “Escalate when necessary.”
Use Examples for Complex Behaviors
Include 2-3 concrete examples of desired conversations:Optimize for Voice
Voice conversations differ from text chat:- Voice-Optimized
- Text-Only (Wrong for Voice)
Handle Edge Cases
Always define behavior for common edge cases: Silence or No Response:Test with Real Scenarios
Use actual customer transcripts to test your prompt:- Collect 10-20 real conversations from your domain
- Run test calls with the same questions
- Compare responses to human agent responses
- Iterate on edge cases and failure patterns
LLM Configuration
Choosing the right model and parameters dramatically affects performance, cost, and latency.Model Selection
Choose based on your use case priority:- Low Latency (Recommended)
- Complex Reasoning
- Cost Optimization
Best For: Real-time voice conversations, customer supportModels:
gpt-4.1-mini(OpenAI) - Best balance of speed and qualityllama-3.1-8b-instant(Groq) - Ultra-fast, cost-effectivegrok-3-mini(xAI) - Fast with good reasoning
- Temperature: 0.6-0.8
- Max tokens: 150-300 (voice responses should be concise)
- Service tier: Auto (OpenAI)
Temperature Tuning
Temperature controls randomness in responses. Find the sweet spot for your use case:| Temperature | Behavior | Best For |
|---|---|---|
| 0.0 - 0.3 | Deterministic, repetitive | Exact scripts, data lookup, strict protocols |
| 0.4 - 0.6 | Focused, consistent | Technical support, compliance-sensitive conversations |
| 0.7 - 0.8 | Balanced (Recommended) | General customer service, sales, most use cases |
| 0.9 - 1.2 | Creative, varied | Personality-driven bots, entertainment, brainstorming |
| 1.3 - 2.0 | Highly creative, unpredictable | Creative writing (not recommended for voice agents) |
- Use the same test conversation 5 times at each temperature (0.5, 0.7, 0.9)
- Measure:
- Consistency (do responses vary appropriately?)
- Accuracy (are facts correct?)
- Tone (does personality match your brand?)
- Choose the lowest temperature that maintains natural conversation
Max Tokens Strategy
Set max tokens based on response length needs: Voice Response Guidelines:- Short answers (100-150 tokens): “Your order ships tomorrow” - Good for quick lookups
- Medium responses (200-300 tokens): Explanations with 2-3 key points - Most voice conversations
- Long responses (400-500 tokens): Detailed troubleshooting steps - Only when necessary
- Avoid 1000+ tokens: Voice users lose attention after 30-45 seconds
Service Tier Priority (OpenAI Only)
OpenAI’s priority tier reduces latency for real-time voice applications:- Priority Tier (Toggle): When enabled, sets
vendorSpecificOptions.service_tier = "priority"for lower latency and higher throughput - Default (Priority Off): Standard latency, lower cost
Voice and Speech Optimization
Voice quality and natural speech patterns are critical for user experience.TTS Provider Selection
Each provider has different strengths:- ElevenLabs
- Cartesia
- Dasha
Best For: High-quality, natural-sounding voicesStrengths:
- Most natural prosody and emotion
- Excellent multilingual support
- Voice cloning capabilities
- Fine-grained emotion controls
- Speed: 0.9-1.1x (1.0 is natural)
- Stability: 0.4-0.6 (higher = more consistent, less expressive)
- Similarity Boost: 0.7-0.8
- Style: 0.2-0.4 (higher = more exaggerated)
Voice Speed Guidelines
Adjust speed based on content complexity and audience:| Speed | Use Case | Example |
|---|---|---|
| 0.8x - 0.9x | Complex information, elderly users | Technical support, healthcare |
| 1.0x | Standard (Recommended) | Most conversations |
| 1.1x - 1.2x | Simple information, younger users | Order confirmations, quick updates |
| 1.3x+ | Very simple, repetitive content | Automated announcements |
Speech Recognition Best Practices
While ASR is auto-selected by BlackBox, you can optimize for it: In Your Prompts:- Ask confirmation questions: “Did you say ECHO-1234?”
- Spell out ambiguous information: “That’s E as in Echo, C as in Charlie…”
- Use verbal checksums: “Your confirmation code is 1-2-3-4. That’s one, two, three, four.”
Conversation Design
Design conversations that feel natural and accomplish goals efficiently.Conversation Flow Patterns
Use these proven patterns:- Greeting → Intent → Action → Close
- Verification → Confirmation → Execution
- Discovery → Qualification → Next Steps
Turn-Taking and Interruptions
Design for natural conversation flow: Allow Natural Interruptions:Error Recovery
Plan for conversation breakdowns: Misunderstanding Recovery:Testing Strategies
Systematic testing prevents production failures and poor user experiences.Pre-Launch Testing Checklist
Test these scenarios before deploying: Happy Path (5-10 tests):- Simple request with immediate answer
- Multi-step conversation (3+ turns)
- Tool/function call succeeds
- Transfer to human works
- End conversation naturally
- Silence for 10+ seconds
- Customer interrupts mid-sentence
- Background noise (music, traffic, talking)
- Customer speaks very fast or slow
- Repeated misrecognition of same word
- Request for something out of scope
- Customer is angry or frustrated
- Multiple requests in one turn
- Customer changes mind mid-conversation
- Tool call times out
- Tool returns error
- Invalid data format
- Webhook doesn’t respond
- Network interruption
- Customer hangs up mid-turn
Load Testing
For high-volume deployments, test concurrency:- Start small: 5-10 concurrent calls
- Measure: Latency, error rate, call quality
- Increase gradually: Double concurrency each round
- Monitor: Watch for degradation patterns
- Set limits: Configure concurrency caps based on results
A/B Testing
Compare agent versions systematically: Version A vs B:- Same agent, different prompts
- Same prompt, different temperatures
- Same config, different voices
- Call success rate (user achieved goal)
- Average call duration
- Tool call accuracy
- User satisfaction (if post-call analysis enabled)
- Transfer rate (lower is often better)
Performance Optimization
Optimize for latency, cost, and quality based on your priorities.Latency Reduction
Voice agents are latency-sensitive. Reduce delays: Choose Fast Components:- LLM:
gpt-4.1-mini,llama-3.1-8b-instant,grok-3-mini - TTS: Cartesia (fastest), Dasha (fast), ElevenLabs (slower but high quality)
- ASR: Auto-selection handles this
- Shorter prompts = faster processing
- Remove redundant instructions
- Use tools instead of long context
- Max tokens: 150-300 for voice
- Concise system prompts
- Discourage verbose responses
Cost Optimization
Reduce costs without sacrificing quality: Model Selection:deepseek-r1- 30x cheaper than GPT-4, similar qualityllama-3.1-8b-instant- Very low cost per tokengpt-4.1-nano- OpenAI’s most cost-effective
- Shorter system prompts (remove examples if not needed)
- Lower max tokens (100-200 for simple agents)
- Use tools for data lookup (don’t put data in prompt)
- Reuse common prompt segments
- Cache knowledge base embeddings
- Minimize unique per-call prompt variations
Quality Monitoring
Track these metrics in production: Per-Call Metrics:- Success rate (did user achieve goal?)
- Call duration (outliers indicate issues)
- Tool call accuracy
- Number of clarification requests
- Daily/weekly call volume trends
- Error rate by error type
- User satisfaction scores (via post-call analysis)
- Transfer rate (escalations to human)
- Error rate > 5% in 1 hour
- Average call duration > 2x baseline
- Success rate < 70%
- Concurrency limit reached
Common Anti-Patterns
Avoid these mistakes that lead to poor user experiences:Overly Complex Prompts
- Anti-Pattern
- Best Practice
Ignoring Voice-Specific Design
Anti-Pattern: Designing for text chat and expecting it to work for voice Example: Using bullet points, tables, URLs, “click here” instructions Best Practice:- Verbal lists: “I have three options for you. First, second, third.”
- No visual references: “I’ll send you a link” not “Click the button below”
- Spell out important codes: “Your code is A-B-C-1-2-3. That’s Alpha Bravo Charlie one two three.”
Not Testing Edge Cases
Anti-Pattern: Only testing happy path scenarios Result: Agents that fail when customers deviate from expected behavior Best Practice:- Test with real background noise
- Test with fast/slow speakers
- Test interruptions and silence
- Test unclear requests
Over-Engineering on Day 1
Anti-Pattern: Building a perfect agent with every feature before testing Result: Months of development before user feedback, misaligned features Best Practice:- Start with minimum viable agent (basic prompt + 1-2 tools)
- Deploy to limited beta users
- Iterate based on real conversation data
- Add complexity only when needed
Ignoring Metrics
Anti-Pattern: “Set it and forget it” - no monitoring after deployment Result: Degraded performance goes unnoticed, user satisfaction drops Best Practice:- Daily review of key metrics (success rate, errors)
- Weekly review of conversation samples
- Monthly prompt optimization based on patterns
- Set up automated alerts for anomalies
Production Deployment Checklist
Use this checklist before going live: Pre-Launch:- Test with 10+ users outside your team
- Review 50+ test conversation transcripts
- Set up monitoring dashboards
- Configure error alerts
- Monitor concurrency limits (contact support if you need a higher cap)
- Test failure scenarios (timeouts, errors)
- Verify webhook endpoints are live
- Test call transfers work
- Confirm business hours are correct
- Set up post-call analysis (optional)
- Start with small percentage of traffic (10-20%)
- Monitor metrics every hour
- Have human fallback ready
- Quick prompt iteration capability
- Support team briefed on escalation
- Daily metric reviews
- Sample conversation reviews
- Collect user feedback
- Adjust prompt based on findings
- Gradually increase traffic
- Weekly performance reviews
- Monthly prompt optimization
- Quarterly voice/model updates
- Regular A/B testing
Next Steps
- Configuration: LLM Configuration, Voice & Speech
- Testing: Testing Overview, Dashboard Testing
- Advanced: Advanced Features, Post-Call Analysis
- Deployment: Production Checklist