Voice Features (TTS/STT)

Complete guide to building voice-enabled AI agents with Text-to-Speech (TTS) and Speech-to-Text (STT).

Overview

AgentSea ADK includes comprehensive voice support for building voice-enabled AI agents:

Speech-to-Text (STT) - Transcribe audio to text
Text-to-Speech (TTS) - Synthesize speech from text
Voice Agent - Wrapper that combines both for voice conversations
Multiple Providers - Cloud and local options
Streaming - Real-time audio streaming
Multiple Languages - Support for 99+ languages

7 Voice Providers Included

STT Providers: OpenAI Whisper (cloud), LemonFox (cost-effective), Local Whisper (privacy)

TTS Providers: OpenAI TTS, LemonFox (cost-effective), ElevenLabs (premium), Piper TTS (local)

Speech-to-Text (STT)

OpenAI Whisper (Cloud)

High-quality transcription with OpenAI's Whisper model - supports 99+ languages with word-level timestamps.

typescript

import { OpenAIWhisperProvider } from '@lov3kaizen/agentsea-core';

const sttProvider = new OpenAIWhisperProvider(process.env.OPENAI_API_KEY);

// Transcribe audio file
const result = await sttProvider.transcribe('./audio.mp3', {
  model: 'whisper-1',
  language: 'en',
  responseFormat: 'verbose_json',
});

console.log('Text:', result.text);
console.log('Language:', result.language);
console.log('Duration:', result.duration);

LemonFox STT (Cost-Effective)

OpenAI-compatible Whisper v3 transcription at a fraction of the cost - $0.50 per 3 hours of audio. Perfect for production workloads.

typescript

import { LemonFoxSTTProvider } from '@lov3kaizen/agentsea-core';

const sttProvider = new LemonFoxSTTProvider(process.env.LEMONFOX_API_KEY);

// Transcribe audio file - same API as OpenAI Whisper
const result = await sttProvider.transcribe('./audio.mp3', {
  model: 'whisper-1',
  language: 'en',
  responseFormat: 'verbose_json',
});

console.log('Text:', result.text);
console.log('Duration:', result.duration);

Alternative: Custom baseURL

You can also use the OpenAI Whisper provider with a custom baseURL:

new OpenAIWhisperProvider({ apiKey: LEMONFOX_API_KEY, baseURL: 'https://api.lemonfox.ai/v1' })

Local Whisper (Privacy)

Run Whisper locally for complete privacy - your audio never leaves your machine.

typescript

import { LocalWhisperProvider } from '@lov3kaizen/agentsea-core';

const sttProvider = new LocalWhisperProvider({
  whisperPath: '/usr/local/bin/whisper',
  modelPath: '/path/to/ggml-base.bin',
});

// Check if installed
if (!(await sttProvider.isInstalled())) {
  console.log(sttProvider.getInstallInstructions());
  return;
}

const result = await sttProvider.transcribe('./audio.wav', {
  model: 'base',
  language: 'en',
});

Text-to-Speech (TTS)

OpenAI TTS

High-quality voices with streaming support - 6 voices available.

typescript

import { OpenAITTSProvider } from '@lov3kaizen/agentsea-core';
import { writeFileSync } from 'fs';

const ttsProvider = new OpenAITTSProvider(process.env.OPENAI_API_KEY);

// Synthesize speech
const result = await ttsProvider.synthesize('Hello, world!', {
  model: 'tts-1-hd',
  voice: 'nova',
  speed: 1.0,
  format: 'mp3',
});

// Save audio
writeFileSync('./output.mp3', result.audio);

LemonFox TTS (Cost-Effective)

OpenAI-compatible TTS with 50+ voices at up to 90% savings - $2.50 per 1M characters. Includes all OpenAI voices plus extras like "sarah".

typescript

import { LemonFoxTTSProvider } from '@lov3kaizen/agentsea-core';
import { writeFileSync } from 'fs';

const ttsProvider = new LemonFoxTTSProvider(process.env.LEMONFOX_API_KEY);

// Synthesize speech - same API as OpenAI TTS
const result = await ttsProvider.synthesize('Hello, world!', {
  model: 'tts-1',
  voice: 'sarah', // or any OpenAI voice like 'nova'
  format: 'mp3',
});

// Save audio
writeFileSync('./output.mp3', result.audio);

// Streaming also supported
for await (const chunk of ttsProvider.synthesizeStream('Long text...')) {
  // Process audio chunks in real-time
}

Alternative: Custom baseURL

You can also use the OpenAI TTS provider with a custom baseURL:

new OpenAITTSProvider({ apiKey: LEMONFOX_API_KEY, baseURL: 'https://api.lemonfox.ai/v1' })

ElevenLabs (Premium)

Studio-quality synthesis with voice cloning and 100+ premium voices.

typescript

import { ElevenLabsTTSProvider } from '@lov3kaizen/agentsea-core';

const ttsProvider = new ElevenLabsTTSProvider(process.env.ELEVENLABS_API_KEY);

// List available voices
const voices = await ttsProvider.getVoices();
console.log('Available voices:', voices.length);

// Use a specific voice
const result = await ttsProvider.synthesize('Hello, world!', {
  voice: 'EXAVITQu4vr4xnSDxMaL', // Sarah voice ID
  model: 'eleven_multilingual_v2',
  stability: 0.5,
  similarityBoost: 0.75,
});

Piper TTS (Local)

Fast neural synthesis running locally - complete privacy with no API costs.

typescript

import { PiperTTSProvider } from '@lov3kaizen/agentsea-core';

const ttsProvider = new PiperTTSProvider({
  piperPath: '/usr/local/bin/piper',
  modelPath: '/path/to/en_US-lessac-medium.onnx',
});

// Check installation
if (!(await ttsProvider.isInstalled())) {
  console.log(ttsProvider.getInstallInstructions());
  return;
}

const result = await ttsProvider.synthesize('Hello, world!', {
  voice: 'lessac',
  speakerId: 0,
});

Voice Agent

The VoiceAgent class wraps a regular Agent with STT and TTS providers for complete voice interactions. It handles the full pipeline: audio input → transcription → agent processing → speech synthesis → audio output.

typescript

import {
  Agent,
  AnthropicProvider,
  VoiceAgent,
  OpenAIWhisperProvider,
  OpenAITTSProvider,
  ToolRegistry,
  BufferMemory,
} from '@lov3kaizen/agentsea-core';
import type { VoiceAgentConfig, TTSConfig } from '@lov3kaizen/agentsea-types';

// Create base agent
const agent = new Agent(
  {
    name: 'voice-assistant',
    model: 'claude-sonnet-4-20250514',
    provider: 'anthropic',
    systemPrompt: 'You are a helpful voice assistant. Keep responses concise.',
  },
  new AnthropicProvider(),
  new ToolRegistry(),
  new BufferMemory(50)
);

// Wrap with voice capabilities
const voiceAgent = new VoiceAgent(agent, {
  sttProvider: new OpenAIWhisperProvider(process.env.OPENAI_API_KEY),
  ttsProvider: new OpenAITTSProvider(process.env.OPENAI_API_KEY),
  ttsConfig: { voice: 'nova', model: 'tts-1' },
  autoSpeak: true, // Automatically synthesize responses
});

Best Practices

Cost-Effective Production

Use LemonFox for production: STT at $0.50/3hrs (lowest on market) and TTS at $2.50/1M chars (up to 90% savings). Same API as OpenAI - just swap the provider!

Voice-Optimized Prompts

Keep responses concise for voice. Add "Keep responses under 2 sentences" to your system prompt.

Audio Format

Use MP3 for storage (smaller), WAV for processing (better quality). Most providers accept both.

Privacy First

Use local providers (Local Whisper + Piper TTS + Ollama) for sensitive data. No data leaves your machine.

Next Steps

Learn about Local Models - Run everything locally
Try the CLI Tool - Interactive voice setup
Explore All Providers - 12+ LLM providers
View Examples - Complete voice examples