Voice Features (TTS/STT)
Complete guide to building voice-enabled AI agents with Text-to-Speech (TTS) and Speech-to-Text (STT).
Overview
AgentSea ADK includes comprehensive voice support for building voice-enabled AI agents:
- Speech-to-Text (STT) - Transcribe audio to text
- Text-to-Speech (TTS) - Synthesize speech from text
- Voice Agent - Wrapper that combines both for voice conversations
- Multiple Providers - Cloud and local options
- Streaming - Real-time audio streaming
- Multiple Languages - Support for 99+ languages
7 Voice Providers Included
STT Providers: OpenAI Whisper (cloud), LemonFox (cost-effective), Local Whisper (privacy)
TTS Providers: OpenAI TTS, LemonFox (cost-effective), ElevenLabs (premium), Piper TTS (local)
Speech-to-Text (STT)
OpenAI Whisper (Cloud)
High-quality transcription with OpenAI's Whisper model - supports 99+ languages with word-level timestamps.
import { OpenAIWhisperProvider } from '@lov3kaizen/agentsea-core';
const sttProvider = new OpenAIWhisperProvider(process.env.OPENAI_API_KEY);
// Transcribe audio file
const result = await sttProvider.transcribe('./audio.mp3', {
model: 'whisper-1',
language: 'en',
responseFormat: 'verbose_json',
});
console.log('Text:', result.text);
console.log('Language:', result.language);
console.log('Duration:', result.duration);LemonFox STT (Cost-Effective)
OpenAI-compatible Whisper v3 transcription at a fraction of the cost - $0.50 per 3 hours of audio. Perfect for production workloads.
import { LemonFoxSTTProvider } from '@lov3kaizen/agentsea-core';
const sttProvider = new LemonFoxSTTProvider(process.env.LEMONFOX_API_KEY);
// Transcribe audio file - same API as OpenAI Whisper
const result = await sttProvider.transcribe('./audio.mp3', {
model: 'whisper-1',
language: 'en',
responseFormat: 'verbose_json',
});
console.log('Text:', result.text);
console.log('Duration:', result.duration);Alternative: Custom baseURL
You can also use the OpenAI Whisper provider with a custom baseURL:
new OpenAIWhisperProvider({ apiKey: LEMONFOX_API_KEY, baseURL: 'https://api.lemonfox.ai/v1' })Local Whisper (Privacy)
Run Whisper locally for complete privacy - your audio never leaves your machine.
import { LocalWhisperProvider } from '@lov3kaizen/agentsea-core';
const sttProvider = new LocalWhisperProvider({
whisperPath: '/usr/local/bin/whisper',
modelPath: '/path/to/ggml-base.bin',
});
// Check if installed
if (!(await sttProvider.isInstalled())) {
console.log(sttProvider.getInstallInstructions());
return;
}
const result = await sttProvider.transcribe('./audio.wav', {
model: 'base',
language: 'en',
});Text-to-Speech (TTS)
OpenAI TTS
High-quality voices with streaming support - 6 voices available.
import { OpenAITTSProvider } from '@lov3kaizen/agentsea-core';
import { writeFileSync } from 'fs';
const ttsProvider = new OpenAITTSProvider(process.env.OPENAI_API_KEY);
// Synthesize speech
const result = await ttsProvider.synthesize('Hello, world!', {
model: 'tts-1-hd',
voice: 'nova',
speed: 1.0,
format: 'mp3',
});
// Save audio
writeFileSync('./output.mp3', result.audio);LemonFox TTS (Cost-Effective)
OpenAI-compatible TTS with 50+ voices at up to 90% savings - $2.50 per 1M characters. Includes all OpenAI voices plus extras like "sarah".
import { LemonFoxTTSProvider } from '@lov3kaizen/agentsea-core';
import { writeFileSync } from 'fs';
const ttsProvider = new LemonFoxTTSProvider(process.env.LEMONFOX_API_KEY);
// Synthesize speech - same API as OpenAI TTS
const result = await ttsProvider.synthesize('Hello, world!', {
model: 'tts-1',
voice: 'sarah', // or any OpenAI voice like 'nova'
format: 'mp3',
});
// Save audio
writeFileSync('./output.mp3', result.audio);
// Streaming also supported
for await (const chunk of ttsProvider.synthesizeStream('Long text...')) {
// Process audio chunks in real-time
}Alternative: Custom baseURL
You can also use the OpenAI TTS provider with a custom baseURL:
new OpenAITTSProvider({ apiKey: LEMONFOX_API_KEY, baseURL: 'https://api.lemonfox.ai/v1' })ElevenLabs (Premium)
Studio-quality synthesis with voice cloning and 100+ premium voices.
import { ElevenLabsTTSProvider } from '@lov3kaizen/agentsea-core';
const ttsProvider = new ElevenLabsTTSProvider(process.env.ELEVENLABS_API_KEY);
// List available voices
const voices = await ttsProvider.getVoices();
console.log('Available voices:', voices.length);
// Use a specific voice
const result = await ttsProvider.synthesize('Hello, world!', {
voice: 'EXAVITQu4vr4xnSDxMaL', // Sarah voice ID
model: 'eleven_multilingual_v2',
stability: 0.5,
similarityBoost: 0.75,
});Piper TTS (Local)
Fast neural synthesis running locally - complete privacy with no API costs.
import { PiperTTSProvider } from '@lov3kaizen/agentsea-core';
const ttsProvider = new PiperTTSProvider({
piperPath: '/usr/local/bin/piper',
modelPath: '/path/to/en_US-lessac-medium.onnx',
});
// Check installation
if (!(await ttsProvider.isInstalled())) {
console.log(ttsProvider.getInstallInstructions());
return;
}
const result = await ttsProvider.synthesize('Hello, world!', {
voice: 'lessac',
speakerId: 0,
});Voice Agent
The VoiceAgent class wraps a regular Agent with STT and TTS providers for complete voice interactions. It handles the full pipeline: audio input → transcription → agent processing → speech synthesis → audio output.
import {
Agent,
AnthropicProvider,
VoiceAgent,
OpenAIWhisperProvider,
OpenAITTSProvider,
ToolRegistry,
BufferMemory,
} from '@lov3kaizen/agentsea-core';
import type { VoiceAgentConfig, TTSConfig } from '@lov3kaizen/agentsea-types';
// Create base agent
const agent = new Agent(
{
name: 'voice-assistant',
model: 'claude-sonnet-4-20250514',
provider: 'anthropic',
systemPrompt: 'You are a helpful voice assistant. Keep responses concise.',
},
new AnthropicProvider(),
new ToolRegistry(),
new BufferMemory(50)
);
// Wrap with voice capabilities
const voiceAgent = new VoiceAgent(agent, {
sttProvider: new OpenAIWhisperProvider(process.env.OPENAI_API_KEY),
ttsProvider: new OpenAITTSProvider(process.env.OPENAI_API_KEY),
ttsConfig: { voice: 'nova', model: 'tts-1' },
autoSpeak: true, // Automatically synthesize responses
});Best Practices
Cost-Effective Production
Use LemonFox for production: STT at $0.50/3hrs (lowest on market) and TTS at $2.50/1M chars (up to 90% savings). Same API as OpenAI - just swap the provider!
Voice-Optimized Prompts
Keep responses concise for voice. Add "Keep responses under 2 sentences" to your system prompt.
Audio Format
Use MP3 for storage (smaller), WAV for processing (better quality). Most providers accept both.
Privacy First
Use local providers (Local Whisper + Piper TTS + Ollama) for sensitive data. No data leaves your machine.
Next Steps
- Learn about Local Models - Run everything locally
- Try the CLI Tool - Interactive voice setup
- Explore All Providers - 12+ LLM providers
- View Examples - Complete voice examples