Skip to main content
Transcribe audio to text (STT) and generate speech from text (TTS). Enable with "audio_*" in allowedTools.
Maximum audio file size: 25 MB (MAX_AUDIO_SIZE). Default timeout: 120 seconds. Zero vendor SDK dependencies — all provider APIs are called via direct fetch.
Providers:
  • Speech-to-text — OpenAI Whisper (default), Deepgram Nova
  • Text-to-speech — OpenAI (default), Deepgram Aura, ElevenLabs, Edge (free, no API key)
Credential resolution (per provider, checked in order):
  1. Agent vault — e.g. service "openai" key "key", service "deepgram" key "key", service "elevenlabs" key "key"
  2. Environment variables — OPENAI_API_KEY, DEEPGRAM_API_KEY, or ELEVENLABS_API_KEY
  3. Automatic fallback — if the cloud provider fails (quota, auth, billing), edge-tts is tried automatically. Edge TTS uses Microsoft Edge’s neural voices locally — free, no API key, ~400 voices in 60+ languages. Install: pip install edge-tts

audio_transcribe — Speech to Text

Transcribe an audio file to text using a speech-to-text provider.

Parameters

ParameterTypeRequiredDescription
pathstringyesPath to the audio file
providerenum: openai | deepgramnoSTT provider (default openai)
modelstringnoModel name. OpenAI: whisper-1 (default). Deepgram: nova-3 (default)
languagestringnoLanguage hint as ISO 639-1 code (e.g. en, es, fr)
promptstringnoPrompt to guide transcription style (OpenAI only)

Returns

Transcribed text string.

Notes

  • Supported audio formats: mp3, wav, flac, ogg, m4a, webm
  • prompt is only supported by the OpenAI provider — it is ignored for Deepgram
  • language helps improve accuracy for non-English audio

audio_speak — Text to Speech

Generate an audio file from text using a text-to-speech provider.

Parameters

ParameterTypeRequiredDescription
textstringyesText to convert to speech
pathstringyesDestination file path for the audio output
providerenum: openai | deepgram | elevenlabs | edgenoTTS provider (default openai). edge is free (no API key). Cloud providers auto-fallback to edge on failure
languagestringnoISO 639-1 language code (e.g. it, en, es). Used by edge to select the right voice
genderenum: male | femalenoVoice gender preference. Used by edge to pick the right voice
modelstringnoModel name (provider-specific defaults apply)
voicestringnoVoice selection (see below for options)
speednumbernoPlayback speed, 0.254.0 (OpenAI only)
instructionsstringnoStyle instructions for the voice (gpt-4o-mini-tts model only)

Returns

Confirmation with the output file path, duration, and file size.

Notes

  • OpenAI voices: alloy, echo, fable, onyx, nova, shimmer (default alloy)
  • Deepgram: model-based voice selection (default aura-2-en)
  • ElevenLabs voices: voice ID string (default Rachel)
  • Edge voices: auto-selected from language + gender (e.g. it + maleit-IT-DiegoNeural). Or pass an explicit voice like it-IT-ElsaNeural
  • OpenAI output formats: mp3, wav, flac, opus, aac, pcm
  • Deepgram output format: mp3
  • ElevenLabs output formats: mp3, wav, flac
  • Edge output format: mp3
  • Output format is determined by the file extension of path
  • speed is only supported by OpenAI — it is ignored for other providers
  • instructions is only supported with the gpt-4o-mini-tts model
  • If the chosen cloud provider fails (quota exceeded, auth error, billing issue), edge-tts is tried automatically as fallback