Speech-to-Text

Transcribe audio to text. Drop-in replacement for POST /v1/audio/transcriptions with automatic fallback across OpenAI Whisper and SiliconFlow SenseVoice.

POST/v1/audio/transcriptions

Content type

multipart/form-data. The audio file is sent as a form field named file.

Form fields

Name	Type	Required	Description
model	string	Required	Standard model alias. Must declare transcription capability. See supported models below.
file	file	Required	Audio file to transcribe. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. Max 25 MB.
language	string		ISO-639-1 language code (e.g. en, zh, fr). Auto-detected when omitted. Providing the language improves accuracy and latency.
prompt	string		Optional context text to guide the model. Useful for proper nouns or domain-specific terminology.
response_format	string		Output format: json (default), text, srt, verbose_json, vtt.
temperature	number		Sampling temperature between 0 and 1. Higher values increase variation. Default 0.

Response

For response_format=json (default), returns a JSON object with a text field containing the transcript and a duration field (seconds, used for billing). Status 200 on success.

Quick start

Python

from openai import OpenAI

client = OpenAI(
    base_url="https://api.therouter.ai",
    api_key="YOUR_API_KEY",
)

with open("audio.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="openai/whisper-1",
        file=f,
    )

print(transcript.text)

Using SenseVoice for Chinese/English

For Chinese or multilingual audio, alibaba/sensevoice-small (SiliconFlow) delivers emotion-aware transcription with strong Chinese support. It is also the automatic fallback when openai/whisper-1 is unavailable.

python

# SiliconFlow SenseVoice — emotion-aware, optimised for Chinese/English
transcript = client.audio.transcriptions.create(
    model="alibaba/sensevoice-small",
    file=open("audio.mp3", "rb"),
    language="zh",          # optional — auto-detected if omitted
    response_format="json",
)
print(transcript.text)

Supported models

Name	Type	Required	Description
openai/whisper-1	$0.006/min		OpenAI Whisper-1. General-purpose multilingual transcription. Automatic fallback to SenseVoice (SiliconFlow) at priority 1.
alibaba/sensevoice-small	$0.006/min		SiliconFlow SenseVoice Small. Emotion-aware, strong Chinese/English accuracy. Also serves as fallback for openai/whisper-1.

Billing is per minute of audio duration. The gateway reads the duration field from the upstream JSON response; if the field is absent (e.g. raw text format), the request is not billed.

Text-to-Speech

Images