Text-to-Speech
Convert text to natural-sounding speech. Drop-in replacement for POST /v1/audio/speech with automatic fallback across OpenAI and SiliconFlow TTS models.
POST/v1/audio/speech
Request body
| Name | Type | Required | Description |
|---|---|---|---|
model | string | Required | Standard model alias. Must declare speech capability. See supported models below. |
input | string | Required | The text to convert. Maximum 4,096 characters for OpenAI models; SiliconFlow models accept up to 1,000 bytes. |
voice | string | Required | Voice to use: alloy, echo, fable, onyx, nova, shimmer. SiliconFlow models map unknown voices to their default. |
response_format | string | Output format: mp3 (default), opus, aac, flac, wav, pcm. | |
speed | number | Speaking speed multiplier, 0.25 to 4.0. Default 1.0. Supported by OpenAI models only. |
Response
Returns audio bytes with the appropriate Content-Type header (e.g. audio/mpeg for mp3). Status 200 on success.
Quick start
Python
from openai import OpenAI
client = OpenAI(
base_url="https://api.therouter.ai",
api_key="YOUR_API_KEY",
)
with client.audio.speech.with_streaming_response.create(
model="openai/tts-1",
voice="alloy",
input="Hello from TheRouter.ai!",
) as response:
response.stream_to_file("output.mp3")
CJK / Chinese text
For Chinese and other CJK text, use alibaba/cosyvoice2-0.5b (SiliconFlow). It is billed per UTF-8 byte rather than per character, making CJK billing proportional to actual data size. The fallback from openai/tts-1 to CosyVoice2 is automatic when the OpenAI provider is unavailable.
python
# SiliconFlow CosyVoice2 — optimised for Chinese text
# Billed per UTF-8 byte (3 bytes per CJK character)
response = client.audio.speech.create(
model="alibaba/cosyvoice2-0.5b",
voice="alloy",
input="你好,欢迎使用 TheRouter!",
)
Supported models
| Name | Type | Required | Description |
|---|---|---|---|
openai/tts-1 | $15/1M chars | OpenAI TTS-1. Low-latency, optimised for speed. Automatic fallback to CosyVoice2 (SiliconFlow) at priority 1. | |
openai/tts-1-hd | $30/1M chars | OpenAI TTS-1 HD. Higher fidelity, slower generation. No SiliconFlow fallback. | |
openai/gpt-4o-mini-tts | $12/1M tokens | GPT-4o Mini TTS. Supports natural-language instructions via the instructions parameter. | |
alibaba/cosyvoice2-0.5b | $7.15/1M bytes | SiliconFlow CosyVoice2 0.5B. CJK-optimised, byte-based billing. Also serves as fallback for openai/tts-1. | |
fishaudio/fish-speech-1.5 | $7.15/1M bytes | Fish Audio Fish-Speech 1.5. Expressive multilingual TTS with voice cloning support. | |
indexteam/indextts-2 | $7.15/1M bytes | IndexTTS-2. Cloning-grade voice fidelity. |
The sync endpoint returns audio inline. For long-form or background TTS, use
POST /v1/jobs with an audio model — the job is queued, the output stored in S3, and a presigned URL returned when complete.