Audio API

Drop-in replacement for the OpenAI Audio API. Route TTS and STT requests across OpenAI and SiliconFlow with automatic fallback, unified billing, and consistent response shapes.

Endpoints

Name	Type	Description
POST /v1/audio/speech	sync	Text-to-speech. Returns audio bytes inline. Drop-in for OpenAI /v1/audio/speech.
POST /v1/audio/transcriptions	sync	Speech-to-text. Returns JSON transcript inline. Drop-in for OpenAI /v1/audio/transcriptions.
POST /v1/jobs (model=audio)	async	Async audio job. Returns 202 with job ID. Poll GET /v1/jobs/:id. Use for long inputs or background processing.

Text-to-Speech models

All models accept the OpenAI voice parameter (alloy, echo, fable, onyx, nova, shimmer). SiliconFlow models map unknown voice names to their default voice.

Name	Type	Description
openai/tts-1	tts	OpenAI TTS-1. Optimised for speed. Billed per character ($15/1M chars). Falls back to SiliconFlow CosyVoice2 at priority 1.
openai/tts-1-hd	tts	OpenAI TTS-1 HD. Optimised for quality. Billed per character ($30/1M chars).
openai/gpt-4o-mini-tts	tts	GPT-4o Mini TTS. Instruction-following voice model. Billed per token + request.
alibaba/cosyvoice2-0.5b	tts	SiliconFlow CosyVoice2 0.5B. CJK-optimised TTS. Billed per UTF-8 byte ($7.15/1M bytes). Best choice for Chinese text.
fishaudio/fish-speech-1.5	tts	Fish Audio Fish-Speech 1.5. Highly expressive multilingual TTS. $7.15/1M UTF-8 bytes.
indexteam/indextts-2	tts	IndexTTS-2. Cloning-grade voice fidelity. $7.15/1M UTF-8 bytes.

Speech-to-Text models

Name	Type	Required	Description
openai/whisper-1	stt		OpenAI Whisper-1. General-purpose multilingual transcription. $0.006/min. Falls back to SiliconFlow SenseVoice at priority 1.
alibaba/sensevoice-small	stt		SiliconFlow SenseVoice Small. Fast, emotion-aware transcription optimised for Chinese and English. $0.006/min.

Sync vs async

The sync endpoints (/v1/audio/speech and /v1/audio/transcriptions) stream the result back inline — best for interactive or short inputs. For long audio files or batch workloads, use POST /v1/jobs with an audio model: the gateway queues the request, stores the output in S3, and returns a presigned URL when complete.

Both sync endpoints are drop-in compatible with the OpenAI SDK. No code changes required — just swap the base URL to https://api.therouter.ai.

Billing

Name	Type	Description
OpenAI TTS	per character	tts-1: $15/1M chars. tts-1-hd: $30/1M chars.
SiliconFlow TTS	per UTF-8 byte	CosyVoice2, Fish-Speech, IndexTTS-2: $7.15/1M UTF-8 bytes. CJK text bills at 3× ASCII char-count.
OpenAI STT	per minute	whisper-1: $0.006/min.
SiliconFlow STT	per minute	SenseVoice-Small: $0.006/min.

Generations

Text-to-Speech