LLM Configuration
The Large Language Model (LLM) is the brain of your AI agent. It processes conversations, generates responses, and makes decisions based on your system prompt and configuration. BlackBox supports multiple LLM vendors, each offering different models with varying capabilities, speeds, and cost structures.
Supported LLM Vendors
BlackBox integrates with 5 major LLM providers:
| Vendor | Best For | Context Window | Key Advantage |
|---|
| OpenAI | Production use, reliability | 128k tokens | Industry-leading quality and reasoning |
| Groq | Ultra-fast inference | 32k tokens | Fastest response times available |
| Grok (xAI) | Advanced reasoning | 128k tokens | Latest AI innovations from xAI |
| DeepSeek | Cost-efficiency | 128k tokens | 30x more cost-efficient than GPT-4 |
| Custom Compatible | Self-hosted models | Varies | Full control and customization |
OpenAI
OpenAI provides the most widely-used and battle-tested LLMs for conversational AI.
Available Models
GPT-4.1 Series (Latest)
gpt-4.1
- Latest flagship model with enhanced reasoning
- Context Window: 128k tokens
- Best For: Complex conversations requiring deep understanding
- Use Cases: Customer support, sales qualification, medical assistance
- Cost Tier: Premium
gpt-4.1-mini
- Faster, more cost-effective version of GPT-4.1
- Context Window: 128k tokens
- Best For: High-volume applications with balanced quality/cost
- Use Cases: Lead qualification, appointment scheduling, FAQs
- Cost Tier: Mid
gpt-4.1-nano
- Ultra-fast with minimal latency
- Context Window: 128k tokens
- Best For: Real-time voice conversations requiring instant responses
- Use Cases: Quick Q&A, simple routing, basic information gathering
- Cost Tier: Low
GPT-4o Series (Multimodal)
gpt-4o
- Multimodal model supporting vision and audio
- Context Window: 128k tokens
- Best For: Applications requiring image understanding
- Use Cases: Visual product support, document analysis
- Cost Tier: Premium
gpt-4o-mini
- Affordable multimodal capabilities
- Context Window: 128k tokens
- Best For: Cost-conscious multimodal applications
- Cost Tier: Mid
Reasoning Models
o3-mini
- Specialized reasoning model
- Context Window: 128k tokens
- Best For: Complex problem-solving and logical reasoning
- Use Cases: Technical troubleshooting, decision trees
- Cost Tier: Premium
OpenAI-Specific Options
Service Tier
OpenAI offers a priority service tier for lower latency at potentially higher cost. This is controlled via a checkbox in the UI labeled “Use priority tier (lower latency)”.
When enabled, the priority tier is set in vendorSpecificOptions:
const agent = {
config: {
llmConfig: {
vendor: "openai",
model: "gpt-4.1",
vendorSpecificOptions: {
service_tier: "priority"
}
}
}
};
Behavior:
-
Enabled (checkbox checked): Sets
vendorSpecificOptions.service_tier to "priority"
- Lower latency and higher request priority
- May increase API costs
- Recommended for latency-sensitive production agents
-
Disabled (checkbox unchecked): The
service_tier field is omitted entirely
- Standard OpenAI behavior (default tier)
- Best for cost-conscious applications
The priority tier is OpenAI-specific and only available when using OpenAI as your LLM vendor. If you switch vendors, this setting is automatically cleared.
Configuration Example
const openaiAgent = await fetch('https://blackbox.dasha.ai/api/v1/agents', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
name: "Customer Support Agent",
config: {
primaryLanguage: "en-US",
llmConfig: {
vendor: "openai",
model: "gpt-4.1-mini",
prompt: "You are a helpful customer support agent for Acme Corp. Be professional, empathetic, and concise.",
options: {
temperature: 0.7,
maxTokens: 1000,
topP: 0.9
},
vendorSpecificOptions: {
service_tier: "priority" // Optional: enable for lower latency
}
},
ttsConfig: { /* ... */ }
}
})
});
Groq
Groq delivers the fastest LLM inference speeds, ideal for real-time voice applications.
Available Models
llama-3.3-70b-versatile
- Production-ready model with broad capabilities
- Context Window: 32k tokens
- Best For: General-purpose voice agents
- Speed: Extremely fast inference
- Use Cases: Any voice application requiring instant responses
llama-3.1-8b-instant
- Ultra-fast inference for simple tasks
- Context Window: 32k tokens
- Best For: High-volume, straightforward conversations
- Speed: Fastest available
- Use Cases: Quick routing, basic Q&A, simple interactions
deepseek-r1-distill-llama-70b
- Reasoning-enhanced model
- Context Window: 32k tokens
- Best For: Decision-making and logical reasoning
- Use Cases: Technical support, troubleshooting
gemma2-9b-it
- Instruction-following specialist
- Context Window: 32k tokens
- Best For: Structured conversations with clear workflows
- Use Cases: Appointment booking, form filling
qwen-2.5-coder-32b
- Code-specialized model
- Context Window: 32k tokens
- Best For: Technical conversations and code discussion
- Use Cases: Developer support, API assistance
Configuration Example
const groqAgent = await fetch('https://blackbox.dasha.ai/api/v1/agents', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
name: "Fast Response Agent",
config: {
primaryLanguage: "en-US",
llmConfig: {
vendor: "groq",
model: "llama-3.3-70b-versatile",
prompt: "You are a quick and efficient assistant. Provide direct, concise answers.",
options: {
temperature: 0.6,
maxTokens: 800,
topP: 0.95
}
},
ttsConfig: { /* ... */ }
}
})
});
Groq excels at low-latency responses. Pair it with fast TTS providers like Cartesia or Dasha for the quickest possible conversations.
Grok (xAI)
Grok models from xAI provide cutting-edge reasoning and conversational capabilities.
Available Models
grok-2
- Latest flagship model with enhanced reasoning
- Context Window: 128k tokens
- Best For: Complex reasoning and nuanced conversations
- Use Cases: Advisory roles, complex customer issues
grok-2-mini
- Faster, cost-effective version
- Context Window: 128k tokens
- Best For: Balanced performance and cost
- Use Cases: General-purpose voice agents
grok-3-mini
- Latest mini model with improved reasoning
- Context Window: 128k tokens
- Best For: Production agents requiring strong reasoning at lower cost
- Use Cases: Sales, support, complex routing
Configuration Example
const grokAgent = await fetch('https://blackbox.dasha.ai/api/v1/agents', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
name: "Advisory Agent",
config: {
primaryLanguage: "en-US",
llmConfig: {
vendor: "grok",
model: "grok-2-mini",
prompt: "You are a knowledgeable advisor. Provide thoughtful, well-reasoned guidance.",
options: {
temperature: 0.8,
maxTokens: 1200
}
},
ttsConfig: { /* ... */ }
}
})
});
DeepSeek
DeepSeek offers breakthrough cost-efficiency while maintaining GPT-4 level quality.
Available Models
deepseek-r1
- Breakthrough reasoning model
- Context Window: 128k tokens
- Cost Efficiency: 30x more cost-efficient than GPT-4
- Best For: Budget-conscious production deployments
- Use Cases: Any application requiring GPT-4 quality at lower cost
- Notable Feature: Advanced reasoning capabilities
deepseek-v3
- GPT-4 equivalent performance
- Context Window: 128k tokens
- Best For: High-quality conversations at reduced cost
- Use Cases: Customer support, sales, complex interactions
Configuration Example
const deepseekAgent = await fetch('https://blackbox.dasha.ai/api/v1/agents', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
name: "Cost-Efficient Agent",
config: {
primaryLanguage: "en-US",
llmConfig: {
vendor: "deepseek",
model: "deepseek-r1",
prompt: "You are an intelligent assistant focused on solving problems efficiently.",
options: {
temperature: 0.7,
maxTokens: 1000,
topP: 0.9
}
},
ttsConfig: { /* ... */ }
}
})
});
DeepSeek’s 30x cost advantage makes it perfect for high-volume applications. Test it against OpenAI for your use case - you may find comparable quality at significantly lower cost.
Custom Compatible Provider
Use any OpenAI-compatible API endpoint, including self-hosted models or alternative providers.
When to Use Custom Providers
- Self-hosted models for data privacy
- Alternative providers with OpenAI-compatible APIs
- Custom fine-tuned models
- On-premise deployments
Required Configuration
Custom providers require additional configuration:
const customAgent = await fetch('https://blackbox.dasha.ai/api/v1/agents', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
name: "Self-Hosted Agent",
config: {
primaryLanguage: "en-US",
llmConfig: {
vendor: "customCompatible",
model: "your-model-name",
endpoint: "https://api.yourprovider.com/v1",
apiKey: "your-provider-api-key",
prompt: "You are a custom AI assistant.",
options: {
temperature: 0.7,
maxTokens: 1000
}
},
ttsConfig: { /* ... */ }
}
})
});
Custom Provider Fields
Endpoint URL (Required)
- Full URL to OpenAI-compatible API
- Must support
/chat/completions endpoint
- Format:
https://api.example.com/v1
- Validation: Must be valid HTTPS URL
API Key (Required)
- Authentication key for your custom provider
- Minimum 10 characters
- Stored securely, never exposed in responses
Model ID (Required)
- Model identifier as expected by your provider
- Can be any string recognized by your endpoint
- Example:
llama-2-70b, custom-gpt-4, fine-tuned-model-v2
Custom providers must implement the OpenAI Chat Completions API format. Incompatible APIs will cause agent failures. Test thoroughly before production use.
LLM Parameters
All LLM vendors support standard configuration parameters that control response behavior.
Temperature
Controls randomness and creativity in responses.
Range: 0.0 to 2.0
Default: Varies by vendor (typically 0.7-1.0)
Recommended: 0.6-0.8 for voice agents
How Temperature Works:
-
Low (0.0-0.5): Focused, deterministic, consistent
- Use for: FAQs, factual information, structured workflows
- Example: “What are your business hours?” → Same answer every time
-
Medium (0.6-0.9): Balanced creativity and consistency
- Use for: General conversation, customer support
- Example: Friendly greetings with natural variation
-
High (1.0-2.0): Creative, varied, unpredictable
- Use for: Storytelling, brainstorming (rarely for voice agents)
- Warning: May produce hallucinations or inconsistent information
// Conservative agent for factual responses
llmConfig: {
options: {
temperature: 0.3 // Very focused and consistent
}
}
// Conversational agent with natural variation
llmConfig: {
options: {
temperature: 0.7 // Balanced and natural
}
}
// Creative agent (use cautiously)
llmConfig: {
options: {
temperature: 1.2 // More creative but less predictable
}
}
For production voice agents, we recommend temperature between 0.6-0.8. Lower values feel robotic, higher values risk inconsistency.
Max Tokens
Limits the maximum length of the LLM’s response.
Type: Positive integer
Default: Varies by model (often 2048-4096)
Recommended: 500-1000 for voice agents
Why Limit Tokens:
- Cost Control: Reduce token usage and API costs
- Conciseness: Force agent to be brief (important for voice)
- Performance: Faster response generation
- User Experience: Avoid long-winded voice responses
Token Estimation:
- ~4 characters per token (English)
- ~1 word = 1.3 tokens (average)
- 100 tokens ≈ 75 words ≈ 300 characters
// Brief responses for quick interactions
llmConfig: {
options: {
maxTokens: 150 // ~100 words, 20-30 seconds of speech
}
}
// Standard conversation
llmConfig: {
options: {
maxTokens: 500 // ~375 words, 60-90 seconds of speech
}
}
// Detailed explanations
llmConfig: {
options: {
maxTokens: 1000 // ~750 words, 2-3 minutes of speech
}
}
Voice conversations over 60 seconds per turn feel unnatural. Keep maxTokens around 500-700 for best user experience.
Top P (Nucleus Sampling)
Alternative to temperature for controlling randomness via probability mass.
Range: 0.0 to 1.0
Default: 1.0 (disabled)
Recommended: 0.9-0.95 when used
How Top P Works:
Top P limits the model to the most probable tokens whose cumulative probability reaches P.
-
0.9: Only consider tokens making up top 90% probability mass
- More focused, reduces unlikely words
- Good for consistent, reliable responses
-
1.0: Consider all tokens
- Full distribution, maximum flexibility
- Standard behavior
// Focused responses using Top P
llmConfig: {
options: {
temperature: 1.0, // Keep standard
topP: 0.9 // Limit to top 90% probability
}
}
Temperature vs Top P:
OpenAI recommends using either temperature or topP, not both. If you set both, temperature takes precedence in most implementations.
| Approach | Temperature | Top P | Use When |
|---|
| Temperature Control | 0.6-0.8 | 1.0 (default) | Standard voice agents |
| Top P Control | 1.0 (default) | 0.9-0.95 | Need precise probability control |
| Conservative | 0.5 | 1.0 | Factual, consistent responses |
| Balanced | 0.7 | 1.0 | Natural conversations |
Vendor Comparison
| Vendor | Average Latency | Context Window | Cost (Relative) |
|---|
| Groq | 50-100ms | 32k | Low |
| OpenAI (nano) | 200-400ms | 128k | Low |
| OpenAI (mini) | 300-600ms | 128k | Medium |
| DeepSeek | 400-800ms | 128k | Very Low |
| Grok (mini) | 400-700ms | 128k | Medium |
| OpenAI (4.1) | 600-1200ms | 128k | High |
For voice agents, latency matters more than for chat. Aim for total response time (LLM + TTS) under 1.5 seconds for natural conversations.
Quality Comparison
| Vendor | Reasoning | Creativity | Instruction Following | Best Use Case |
|---|
| OpenAI GPT-4.1 | Excellent | Excellent | Excellent | Complex support |
| Grok-2 | Excellent | Very Good | Excellent | Advisory roles |
| DeepSeek-R1 | Very Good | Good | Very Good | Cost-conscious production |
| Groq Llama-3.3 | Good | Good | Very Good | Speed-critical apps |
| OpenAI o3-mini | Excellent | Good | Very Good | Reasoning tasks |
Cost Efficiency
| Vendor & Model | Cost Tier | Best Value For |
|---|
| DeepSeek-R1 | Lowest | High-volume production |
| Groq (any) | Low | Speed + cost balance |
| OpenAI nano | Low-Mid | Simple interactions |
| OpenAI mini | Mid | Balanced quality/cost |
| Grok mini | Mid | Advanced reasoning at lower cost |
| OpenAI 4.1 | High | Premium quality required |
Choosing the Right LLM
Decision Framework
Start with these questions:
-
What’s your priority?
- Speed → Groq
- Quality → OpenAI GPT-4.1
- Cost → DeepSeek
- Balance → OpenAI mini or Grok mini
-
How complex are conversations?
- Simple Q&A → Groq llama-3.1-8b-instant
- General support → OpenAI mini or DeepSeek-R1
- Complex reasoning → OpenAI 4.1 or Grok-2
-
What’s your call volume?
- High volume → DeepSeek (cost efficiency)
- Medium volume → OpenAI mini
- Low volume → OpenAI 4.1 (premium quality)
-
Do you need special features?
- Vision/multimodal → OpenAI GPT-4o
- Code discussion → Groq qwen-2.5-coder
- Reasoning → DeepSeek-R1 or OpenAI o3-mini
Common Configurations
Customer Support Agent
llmConfig: {
vendor: "openai",
model: "gpt-4.1-mini",
options: {
temperature: 0.7,
maxTokens: 600
},
vendorSpecificOptions: {
service_tier: "priority" // Optional: for lower latency
}
}
High-Speed Lead Qualifier
llmConfig: {
vendor: "groq",
model: "llama-3.3-70b-versatile",
options: {
temperature: 0.6,
maxTokens: 400
}
}
Cost-Optimized Production Agent
llmConfig: {
vendor: "deepseek",
model: "deepseek-r1",
options: {
temperature: 0.7,
maxTokens: 800
}
}
Complex Advisory Agent
llmConfig: {
vendor: "grok",
model: "grok-2",
options: {
temperature: 0.8,
maxTokens: 1000
}
}
Testing and Optimization
A/B Testing LLMs
Compare different vendors for your specific use case:
-
Create identical agents with different LLM configs
-
Run parallel test calls with same scenarios
-
Measure:
- Response quality (user satisfaction)
- Response speed (average latency)
- Response length (token usage)
- Conversation success rate
-
Compare costs over 100-1000 calls
// Agent A: OpenAI
const agentA = { llmConfig: { vendor: "openai", model: "gpt-4.1-mini" } };
// Agent B: DeepSeek
const agentB = { llmConfig: { vendor: "deepseek", model: "deepseek-r1" } };
// Agent C: Groq
const agentC = { llmConfig: { vendor: "groq", model: "llama-3.3-70b-versatile" } };
Parameter Tuning
Temperature Tuning:
- Start at 0.7 (balanced)
- Test with real conversation scenarios
- Adjust based on observations:
- Too robotic/repetitive → Increase to 0.8-0.9
- Too creative/inconsistent → Decrease to 0.5-0.6
- Hallucinating information → Decrease to 0.3-0.5
MaxTokens Tuning:
- Monitor average response length in production
- If responses frequently truncated → Increase maxTokens
- If responses too long → Decrease maxTokens or improve prompt
- Optimal: 90% of responses complete, none over 60 seconds spoken
Next Steps
Now that you’ve configured your LLM, continue building your agent:
API Cross-References