Audio API
Drop-in replacement for the OpenAI Audio API. Route TTS and STT requests across OpenAI and SiliconFlow with automatic fallback, unified billing, and consistent response shapes.
Endpoints
| Name | Type | Required | Description |
|---|---|---|---|
POST /v1/audio/speech | sync | Text-to-speech. Returns audio bytes inline. Drop-in for OpenAI /v1/audio/speech. | |
POST /v1/audio/transcriptions | sync | Speech-to-text. Returns JSON transcript inline. Drop-in for OpenAI /v1/audio/transcriptions. | |
POST /v1/jobs (model=audio) | async | Async audio job. Returns 202 with job ID. Poll GET /v1/jobs/:id. Use for long inputs or background processing. |
Text-to-Speech models
All models accept the OpenAI voice parameter (alloy, echo, fable, onyx, nova, shimmer). SiliconFlow models map unknown voice names to their default voice.
| Name | Type | Required | Description |
|---|---|---|---|
openai/tts-1 | tts | OpenAI TTS-1. Optimised for speed. Billed per character ($15/1M chars). Falls back to SiliconFlow CosyVoice2 at priority 1. | |
openai/tts-1-hd | tts | OpenAI TTS-1 HD. Optimised for quality. Billed per character ($30/1M chars). | |
openai/gpt-4o-mini-tts | tts | GPT-4o Mini TTS. Instruction-following voice model. Billed per token + request. | |
alibaba/cosyvoice2-0.5b | tts | SiliconFlow CosyVoice2 0.5B. CJK-optimised TTS. Billed per UTF-8 byte ($7.15/1M bytes). Best choice for Chinese text. | |
fishaudio/fish-speech-1.5 | tts | Fish Audio Fish-Speech 1.5. Highly expressive multilingual TTS. $7.15/1M UTF-8 bytes. | |
indexteam/indextts-2 | tts | IndexTTS-2. Cloning-grade voice fidelity. $7.15/1M UTF-8 bytes. |
Speech-to-Text models
| Name | Type | Required | Description |
|---|---|---|---|
openai/whisper-1 | stt | OpenAI Whisper-1. General-purpose multilingual transcription. $0.006/min. Falls back to SiliconFlow SenseVoice at priority 1. | |
alibaba/sensevoice-small | stt | SiliconFlow SenseVoice Small. Fast, emotion-aware transcription optimised for Chinese and English. $0.006/min. |
Sync vs async
The sync endpoints (/v1/audio/speech and /v1/audio/transcriptions) stream the result back inline β best for interactive or short inputs. For long audio files or batch workloads, use POST /v1/jobs with an audio model: the gateway queues the request, stores the output in S3, and returns a presigned URL when complete.
Both sync endpoints are drop-in compatible with the OpenAI SDK. No code changes required β just swap the base URL to
https://api.therouter.ai.Billing
| Name | Type | Required | Description |
|---|---|---|---|
OpenAI TTS | per character | tts-1: $15/1M chars. tts-1-hd: $30/1M chars. | |
SiliconFlow TTS | per UTF-8 byte | CosyVoice2, Fish-Speech, IndexTTS-2: $7.15/1M UTF-8 bytes. CJK text bills at 3Γ ASCII char-count. | |
OpenAI STT | per minute | whisper-1: $0.006/min. | |
SiliconFlow STT | per minute | SenseVoice-Small: $0.006/min. |