Speech-to-Text
Transcribe audio to text. Drop-in replacement for POST /v1/audio/transcriptions with automatic fallback across OpenAI Whisper and SiliconFlow SenseVoice.
POST/v1/audio/transcriptions
Content type
multipart/form-data. The audio file is sent as a form field named file.
Form fields
| Name | Type | Required | Description |
|---|---|---|---|
model | string | Required | Standard model alias. Must declare transcription capability. See supported models below. |
file | file | Required | Audio file to transcribe. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. Max 25 MB. |
language | string | ISO-639-1 language code (e.g. en, zh, fr). Auto-detected when omitted. Providing the language improves accuracy and latency. | |
prompt | string | Optional context text to guide the model. Useful for proper nouns or domain-specific terminology. | |
response_format | string | Output format: json (default), text, srt, verbose_json, vtt. | |
temperature | number | Sampling temperature between 0 and 1. Higher values increase variation. Default 0. |
Response
For response_format=json (default), returns a JSON object with a text field containing the transcript and a duration field (seconds, used for billing). Status 200 on success.
Quick start
Python
from openai import OpenAI
client = OpenAI(
base_url="https://api.therouter.ai",
api_key="YOUR_API_KEY",
)
with open("audio.mp3", "rb") as f:
transcript = client.audio.transcriptions.create(
model="openai/whisper-1",
file=f,
)
print(transcript.text)
Using SenseVoice for Chinese/English
For Chinese or multilingual audio, alibaba/sensevoice-small (SiliconFlow) delivers emotion-aware transcription with strong Chinese support. It is also the automatic fallback when openai/whisper-1 is unavailable.
python
# SiliconFlow SenseVoice β emotion-aware, optimised for Chinese/English
transcript = client.audio.transcriptions.create(
model="alibaba/sensevoice-small",
file=open("audio.mp3", "rb"),
language="zh", # optional β auto-detected if omitted
response_format="json",
)
print(transcript.text)
Supported models
| Name | Type | Required | Description |
|---|---|---|---|
openai/whisper-1 | $0.006/min | OpenAI Whisper-1. General-purpose multilingual transcription. Automatic fallback to SenseVoice (SiliconFlow) at priority 1. | |
alibaba/sensevoice-small | $0.006/min | SiliconFlow SenseVoice Small. Emotion-aware, strong Chinese/English accuracy. Also serves as fallback for openai/whisper-1. |
Billing is per minute of audio duration. The gateway reads the
duration field from the upstream JSON response; if the field is absent (e.g. raw text format), the request is not billed.